Stories by LunaVerse on Medium

When Data Scientist meets Robotics

LunaVerse — Mon, 28 Jul 2025 14:11:24 GMT

The Golden Intersection

Mastering these areas will make you a top-tier candidate.

1. Computer Vision (CV)

This is the #1 most critical skill for you. Robots need to “see.”

Why it’s important: For navigation, object detection, manipulation, and scene understanding.

Skills to Focus On:

Fundamentals: Image processing with OpenCV (filtering, transformations, feature detection like SIFT/ORB).

Deep Learning for CV:

Object Detection: Using models like YOLO, SSD, or Faster R-CNN to find and classify objects in an image/video feed.
Image Segmentation: Classifying every pixel in an image (e.g., to distinguish the road from the sidewalk). Models like U-Net are key.
3D Perception: Working with data from stereo cameras or RGB-D sensors (like Intel RealSense) to understand depth and geometry.
Tools: PyTorch (more popular in research/robotics) or TensorFlow, OpenCV.

2. Reinforcement Learning (RL)

This is how you make robots learn complex tasks on their own.

Why it’s important: For tasks that are difficult to program explicitly, like learning to walk, grasp novel objects, or navigate complex terrains.

Skills to Focus On:

Core Concepts: Understand the Markov Decision Process (MDP), Bellman equations, Value functions, and Policy functions.

Algorithms:

Value-based: Q-Learning, Deep Q-Networks (DQN).
Policy-based: REINFORCE, Actor-Critic methods (A2C, A3C), and state-of-the-art like PPO and SAC.
Tools: Gymnasium (formerly OpenAI Gym) for environments, Stable Baselines3 for implementations, MuJoCo or PyBullet for physics simulation.

3. State Estimation & Sensor Fusion

Robots are loaded with sensors (IMUs, GPS, LiDAR, cameras). Data Science techniques are used to make sense of this noisy, multi-modal data.

Why it’s important: To get a reliable estimate of the robot’s state (position, orientation, velocity) from imperfect sensor readings.

Skills to Focus On:

Probabilistic Robotics: This is the core theory.
Kalman Filters: Understand the standard Kalman Filter, Extended Kalman Filter (EKF), and Unscented Kalman Filter (UKF). You should be able to implement a simple one from scratch.
Particle Filters: Another powerful technique for non-Gaussian systems.
Time Series Analysis: Your sensor data is time-series data. Understanding concepts from this field is a huge plus.

4. SLAM (Simultaneous Localization and Mapping)

This is the problem of putting a robot in an unknown environment and having it build a map while simultaneously figuring out its location within that map.

Why it’s important: The fundamental problem for any truly autonomous mobile robot (drones, self-driving cars, warehouse bots).

Skills to Focus On:

Understand the probabilistic nature of SLAM.
Know the difference between different approaches (e.g., Filter-based SLAM, Graph-based SLAM).
Familiarity with visual SLAM (like ORB-SLAM) or LiDAR-based SLAM (like Cartographer).

Core Technical Toolkit You MUST Have

These are the foundational tools to implement the concepts above.

1. Python (Expert Level)

You need to be fluent. Your code will run on robots, so it needs to be efficient and clean.

2. ROS (Robot Operating System)

This is non-negotiable for a robotics role. ROS is the standard framework for writing robot software. It handles communication between different processes (nodes).

You must be comfortable with:

Creating nodes, topics, services, and actions.
Launching files.
Visualizing data with RViz.
Integrating your Python/C++ code into the ROS ecosystem.

3. Simulators

You can’t always test on a real, expensive robot. Simulators are crucial for development, training RL agents, and safe testing.
Tools: Gazebo (the standard for ROS), NVIDIA Isaac Sim (excellent for realistic rendering and physics), MuJoCo/PyBullet (popular for RL research).

4. C++

While Python is great for high-level logic and ML, high-performance robotics code (like controllers, planners, and perception pipelines) is often written in C++ for speed.
You don’t need to be a C++ wizard, but you should be comfortable reading it and writing basic to intermediate code.

Job Roles to Target

Use these keywords to search:

Perception Engineer
Robotics Software Engineer (AI/ML)
Motion Planning Engineer
Autonomous Systems Engineer
ML Engineer — Robotics
Computer Vision Engineer

Good luck — you’ve chosen a fantastic and future-proof field

Correlation Does Not Imply Causation

LunaVerse — Mon, 28 Jul 2025 14:04:45 GMT

Definitions:

Correlation: A statistical measure that describes the size and direction of a linear relationship between two or more variables. When one variable changes, the other variable tends to change in a specific direction.

Positive Correlation: As variable A increases, variable B tends to increase (e.g., height and weight).
Negative Correlation: As variable A increases, variable B tends to decrease (e.g., hours of sleep and feelings of tiredness).
The Key Word: “Tends.” Correlation describes a pattern of association, not a rule of consequence.

Causation (or Causality): A relationship where a change in one variable is directly responsible for a change in another variable. This implies a mechanism and a directional link where one event (the cause) brings about the other (the effect).

Example: Pressing the power button on your remote control causes the television to turn on. The link is direct, mechanical, and predictable.

Why Correlation is Not Causation: The Three Possibilities

When you observe a strong correlation between two variables, A and B, there are three main possibilities, only one of which is direct causation.

Direct Causation (A causes B, or B causes A):

This is what we often mistakenly assume. For example, increased study time (A) is likely a direct cause of higher exam scores (B).

2. Confounding Variable (A third variable, C, causes both A and B):

This is the most common source of confusion and the most important concept for a data scientist to identify. The hidden “confounder” creates an illusion of a direct relationship.
Classic Example: There is a strong positive correlation between ice cream sales (A) and the number of shark attacks (B).
Fallacy: Eating ice cream causes shark attacks.
Reality: A confounding variable, hot weather ©, causes people to both buy more ice cream and swim in the ocean more often, which leads to more shark encounters. A and B are correlated only because they are both effects of C.

3. Coincidence (The relationship is purely due to random chance):

With vast amounts of data, it’s possible to find variables that are mathematically correlated with no logical connection whatsoever. For instance, the number of films Nicolas Cage appeared in per year might correlate with the number of people who drowned by falling into a pool. This is spurious and meaningless.

The Data Scientist’s Role: Playing Detective with Data

Imagine you are a detective. You arrive at a scene and see two things: a broken window and a baseball on the floor inside.

Correlation: You have observed a correlation. The broken window and the baseball are associated in the same time and place.
Causation: Your immediate hypothesis is causation: the baseball caused the window to break.

The job of a data scientist is to be a good detective: to prove that the baseball really did cause the window to break, and it wasn’t something else. This process of trying to prove cause-and-effect is called Causal Inference.

Identifying correlation is the easy first step (like df.corr() in Python).

It’s like finding the baseball and the broken window. The real work is proving the cause. Here are the methods data scientists use, from the most reliable (“The Gold Standard”) to the more clever “detective work” methods.

Method 1: The Gold Standard — The A/B Test (or Randomized Controlled Trial)

This is the most reliable and powerful way to prove causation. It’s like re-creating the crime under controlled conditions.

The Problem: Maybe a strong wind broke the window and the baseball was already on the floor. The wind is a “confounding variable” — a hidden cause. How can we be sure it was the baseball?

The A/B Test Solution: Let’s design an experiment.

We get 1,000 identical windows.
We randomly divide them into two groups of 500.

Group A (Control Group): We do nothing to these windows.
Group B (Treatment Group): We throw a baseball at each of these windows.

We count how many windows break in each group.

Why it Works: By randomizing, we have “neutralized” all other possible causes. The chance of a strong wind, a bird strike, or a manufacturing defect is now equally likely for both groups. So, the only meaningful difference between Group A and Group B is that we threw a baseball at Group B.
The Conclusion: If significantly more windows break in Group B, we can be extremely confident that throwing the baseball caused the windows to break. This is the logic behind every A/B test for a new feature on a website.

Method 2: When You Can’t Run an A/B Test (Quasi-Experimental Methods)

Sometimes, running an A/B test is impossible, unethical, or too expensive. You can’t randomly assign some people to smoke cigarettes and others not to. You can’t force a new law on only half a city.

In these cases, you only have observational data (data you collected by just watching the world). You have to be a cleverer detective.

Let’s say we want to know if a new tutoring program (the “treatment”) causes higher test scores. We can’t force kids into it. We just have data on kids who chose to join and those who didn’t.

Clever Detective Trick #1: Regression with Control Variables

The Problem: The kids who joined the tutoring program might already be the most motivated or have parents who are more involved. This “motivation” is a confounding variable. We can’t tell if the program worked or if the kids were already on track to do better.

The Solution: We can try to statistically control for this. When we build a regression model to predict test scores, we don’t just include joined_program (yes/no). We also add other variables (the “controls”) we think might be confounders, like hours_spent_on_homework, parents_education_level, or previous_gpa.
The Logic: By including these variables, the model tries to isolate the effect of the tutoring program while holding the other factors constant. It’s like asking, “For two students with the same motivation and homework habits, what was the effect of one of them joining the program?” It’s not as good as an A/B test, but it’s a way to try and account for obvious confounders.

Clever Detective Trick #2: Difference-in-Differences (DiD)

The Problem: How do we know the scores wouldn’t have gone up anyway? Maybe there was a city-wide trend of improving education.

The Setup: Imagine a nearby city, City B, that is very similar to our city, City A, but it did not launch a tutoring program.

The Solution:

We measure the average test scores in both cities before the program starts in City A.
We measure the average test scores in both cities after the program has been running for a year in City A.

The Logic:

The change in scores in City B represents the “natural trend” (what would have happened anyway).
The change in scores in City A represents the natural trend plus the effect of the tutoring program.
Therefore, the “difference in the differences” between the two cities gives us an estimate of the true causal effect of the program.

These methods are called “quasi-experimental” because they try to mimic a real experiment using clever statistical techniques when a true experiment isn’t possible.

Statistical tests: z-Test, t-Test, Chi-Squared Test, Correlation test

LunaVerse — Fri, 25 Jul 2025 18:00:32 GMT

This is your practical playbook. To choose the right statistical test, you don’t start with the test name; you start with the business question and identify the number and type of variables involved. This framework maps your data directly to the correct statistical tool.

Part 1: Analysis of a Single Feature

This is for when you want to compare a property of a single group to a known standard, target, or historical benchmark.

Case: The feature is NUMERICAL (e.g., “Average Order Value”)

Business Question: “Our historical average order value (AOV) is $50. After a site-wide redesign, we sampled 100 new orders and found the average was $54. Is this increase statistically significant?”
The Tests: One-Sample t-Test or One-Sample z-Test.

The Critical Choice: z-Test vs. t-Test

This is a classic statistics question. The choice depends on one thing: Do you know the true population standard deviation (σ)?

Use a z-Test if: You know σ. This is rare. It might happen in a controlled manufacturing process or if you have access to the entire population’s data (which is almost never the case).
Use a t-Test if: You do not know σ and have to estimate it from your sample (using the sample standard deviation, s). This is the situation 99% of the time in data science. The t-test accounts for the extra uncertainty of estimating the spread from a sample.
Conclusion: For practical purposes, when you hear “test of a mean,” you should think t-test.

Case: The feature is CATEGORICAL (e.g., “Clicked Ad: Yes/No”)

Test: One-Proportion z-Test.
Business Question: “Historically, 10% of users who get a support email click the feedback link. We re-wrote the email, and in a sample of 1000 users, 120 clicked. Is this new 12% click-rate a significant improvement?”
How it Works: Compares an observed sample proportion to a known or hypothesized population proportion.

Part 2: Analysis of Two Features (Relationships & Comparisons)

This is for when you want to understand the relationship between two variables.

Case: Both features are CATEGORICAL

Test: Chi-Squared (χ²) Test.

Business Question (Independence): “Is there a relationship between a user’s country (USA, India, UK) and the subscription plan they choose (Basic, Premium)?”

This is the Chi-Squared Test for Independence. It checks if the variables are associated by comparing observed counts in a contingency table to the counts we’d expect if they were independent. This is fundamental for user segmentation.

Business Question (Goodness of Fit): “We expect our support tickets to be distributed 50% for ‘Billing’, 30% for ‘Technical’, and 20% for ‘General’. Our observed counts this week were 110, 50, and 40. Do our observations fit our expectations?”

This is the Chi-Squared Goodness of Fit Test. It compares the observed counts of a single categorical variable against a hypothesized distribution.

Case: One feature NUMERICAL, one feature CATEGORICAL

This is the heart of A/B testing and group comparisons. The categorical variable defines the groups you want to compare.

Sub-case: The categorical feature has 2 groups (e.g., Control vs. Variant)

Test: Two-Sample Independent t-Test.
Business Question: “Does our new feature (Variant group) lead to a higher average session time (numerical) compared to the old design (Control group)?”
Key Interview Topic — Assumptions: Be ready to discuss the assumptions: (1) Independence of groups, (2) Normality of the numerical data within each group (or large sample size), and (3) Equality of variances. If assumptions fail, mention alternatives like Welch’s t-test (for unequal variances) or the non-parametric Mann-Whitney U test (if data is not normal).

Special Sub-case: The two groups are related (e.g., Before vs. After)

Test: Paired t-Test.
Business Question: “We measured the performance of 20 employees, gave them a new tool, and then measured the performance of the same 20 employees again. Did the tool have an effect?”
Why it’s Different: This test is more powerful because it controls for individual variation. It analyzes the differences for each pair, not the raw values of the groups.

Sub-case: The categorical feature has 3 or more groups (e.g., Gold vs. Silver vs. Bronze tier)

Test: ANOVA (Analysis of Variance).
Business Question: “Do users on our Gold, Silver, or Bronze plans have different average monthly spending?”
Interview Focus: Running multiple t-tests inflates the risk of a false positive. ANOVA tests all groups at once. If it’s significant, your job isn’t done. You must perform post-hoc tests (e.g., Tukey’s test) to find out which specific pairs of groups are different from each other.

Case: Both features are NUMERICAL

Test: Correlation Test (Pearson).
Business Question: “Is there a linear relationship between a user’s age and their monthly spending?”
How it Works: Calculates the correlation coefficient r to measure the strength and direction of a linear relationship. The test provides a p-value to determine if this relationship is statistically significant.

Question: The “Choose Your Design” Scenario: We want to test a new training program. Should we measure 20 people before and after (Design 1), or test 20 people with the training against 20 without (Design 2)?

Answer: Design 1 calls for a paired t-test, while Design 2 uses a two-sample independent t-test. Design 1 is often more statistically powerful because by measuring the same individuals, we control for their baseline abilities, reducing the overall noise in the experiment and making it easier to detect a true effect from the training.

Module 4: Statistical Inference — Making Judgements from Data

LunaVerse — Fri, 25 Jul 2025 17:06:01 GMT

Module 4: Statistical Inference — Making Judgements from Data

Topic 4.1: The Framework of Inference: Estimation

Before we test a specific claim (hypothesis testing), we often want to simply estimate an unknown value from the population.

Imagine you’re a product manager for a streaming service. The CEO asks a simple question: “What’s the average number of hours our users stream per week?” This true average across all millions of users is the population parameter, and it’s unknown.

Point Estimation: You take a sample of 1,000 users and calculate their average streaming time. The result is 8.2 hours. This single number, 8.2, is your point estimate. It’s your single best guess for the true population average.
Interval Estimation (Confidence Intervals): This is a much more powerful and honest approach. Instead of giving one number, you provide a range of plausible values. You might come back and say, “I am 95% confident that the true average weekly streaming time for our users is between 7.9 and 8.5 hours.”

Question (Confidence Interval Calculation): A data scientist samples 64 users and finds their average daily time on an app is 30 minutes. The population standard deviation (σ) is known to be 8 minutes. Calculate the 95% confidence interval for the true mean daily time on the app. (The Z-value for 95% confidence is 1.96).

Solution: Identify knowns: Sample mean x̄ = 30, population std dev σ = 8, sample size n = 64, Z-value = 1.96.

Calculate Standard Error (SE): SE = σ / √n = 8 / √64 = 8 / 8 = 1.
Calculate Margin of Error (ME): ME = Z * SE = 1.96 * 1 = 1.96.
Construct the Interval: Interval = x̄ ± ME = 30 ± 1.96.
Lower bound = 30–1.96 = 28.04.
Upper bound = 30 + 1.96 = 31.96.

The 95% confidence interval is [28.04, 31.96].

Interviewer (The “Practical Levers” Question): We’ve just completed a pilot study for a new feature and the 95% confidence interval for the uplift in user engagement is [-0.5%, +7%]. The marketing team is excited about the 7%, but the finance team is worried about the -0.5%. The VP asks you, ‘This range is too wide to make a launch decision. What can we do to get a narrower confidence interval?’

Answer: A wide confidence interval indicates a high degree of uncertainty.

Increase the Sample Size: The width of the confidence interval is inversely proportional to the square root of the sample size. If we want to cut the interval width in half, we need to quadruple our sample size. I would propose running the study on a larger group of users to get a more stable and precise measurement of the feature’s impact.
Decrease the Confidence Level: We could calculate a 90% or 80% confidence interval instead of 95%. This would mathematically produce a narrower range. We’d have a narrower range, but a higher risk (e.g., a 1 in 10 chance instead of 1 in 20) that our interval completely misses the true value. It’s a trade-off between precision and confidence, and for most business decisions, we want to maintain high confidence.

Topic 4.2: The Framework of Inference: Hypothesis Testing

This is the formal, scientific process for using sample data to test a claim about a population.

Imagine you’re a data scientist at a gaming company. The company releases a patch to make a difficult level easier. The goal is to reduce the average completion time from its historical average of 60 seconds. You collect data from a sample of 100 players after the patch and find their new average time is 58 seconds.

Is the level really easier? Or could this 2-second difference just be random chance from the specific 100 players you happened to observe? Hypothesis testing gives us a framework to decide.

The Core Logic: We start by assuming the patch did nothing. This is the Null Hypothesis (H₀)

H₀: The true average completion time is still 60 seconds (μ = 60).

We then state what we are hoping to prove. This is the Alternative Hypothesis (H₁).

H₁: The true average completion time is now less than 60 seconds (μ < 60).

We then calculate the probability of seeing our result (a sample mean of 58s or even less) if the null hypothesis were true. This probability is the famous p-value.

Significance Level (alpha, α): Before we even start, we set a threshold of “unlikeliness,” typically 5% (α = 0.05). This is our standard for rejecting the status quo(H₀)

The Decision:

If the p-value is very small (e.g., p=0.01), it means our result was very unlikely to happen by random chance alone. Since it did happen, we doubt our initial assumption. We reject the Null Hypothesis (H₀) and conclude that the patch likely had an effect.
If the p-value is large (e.g., p=0.30), it means our result was quite plausible under the null hypothesis. We fail to reject the Null Hypothesis (H₀). This doesn’t prove the patch did nothing, but it means we don’t have enough evidence to claim it did.

Type I & Type II Errors (Crucial Interview Topic): This is about the two ways our conclusion can be wrong.

Type I Error (α): A False Positive. We reject H₀ when it was actually true. (We conclude the patch worked, but the 2-second difference was just a random fluke). The probability of this error is our significance level, α.
Type II Error (β): A False Negative. We fail to reject H₀ when it was actually false. (The patch really did work, but our sample size was too small or the effect was too subtle for us to detect it, so we missed it).

Question: In the context of a criminal trial, the Null Hypothesis (H₀) is “the defendant is innocent.” What would constitute a Type I Error in this scenario?

(A) An innocent person is set free.

(B) A guilty person is set free.

( C) An innocent person is convicted.

(D) A guilty person is convicted.

Solution:

A Type I error is rejecting a true Null Hypothesis.
The Null Hypothesis is “the defendant is innocent.”
Rejecting this hypothesis means declaring the defendant “not innocent,” i.e., guilty.
Therefore, a Type I error is convicting a defendant who was, in fact, innocent.

Answer c: An innocent person is convicted. (B is a Type II Error).

Interviewer (The “p-value for a 5-year-old” Question): Imagine I’m a business manager with no statistical background. Explain what a p-value is to me. Use a simple analogy.

Answer: Let’s imagine we’re testing a new ad campaign. The old ad has a known click rate. We run the new ad and see it gets a higher click rate. The p-value answers one simple question:

‘If our new ad is actually just as good as the old one (and the higher rate we saw was pure luck), what’s the chance of seeing a result this good or even better?’

If the p-value is high, say 40%, it means there’s a 40% chance the new ad’s performance was just a lucky fluke. That’s too high a chance to bet on, so we’d say we don’t have enough evidence it’s better.
If the p-value is very low, say 1%, it means there’s only a 1% chance we’d see such a good result by luck alone. Because that’s so unlikely, we’d bet that it wasn’t luck — the new ad is genuinely better.

So, a small p-value means ‘probably not luck,’ and a large p-value means ‘could easily be luck.’”

Interviewer (The “Error Trade-off” Question): We’re building a machine learning model to detect fraudulent transactions. In this context, what is a Type I error and what is a Type II error? Which one do you think is more costly for the business, and how might that influence the threshold you set for your model?

Answer: Let’s set up the hypothesis:

Null Hypothesis (H₀): The transaction is not fraudulent.
Alternative Hypothesis (H₁): The transaction is fraudulent.

Now let’s define the errors:

Type I Error (False Positive): We reject the null hypothesis when it’s true. This means we flag a legitimate transaction as fraudulent. The consequence is that we block a valid customer’s payment. This is a very bad customer experience and could cause them to abandon their cart or even stop using our service.
Type II Error (False Negative): We fail to reject the null hypothesis when it’s false. This means we let a fraudulent transaction go through as if it were legitimate. The consequence is a direct financial loss for the company, as we have to cover the cost of the chargeback.

Which is more costly? This is the crucial trade-off.

In many e-commerce scenarios, the Type I error (blocking a good customer) is considered more costly. The lifetime value of a lost customer is often far greater than the value of a single fraudulent transaction. Annoying your customers is a major business risk.
In other scenarios, like very high-value B2B transactions, the Type II error (letting fraud through) might be more costly because a single fraudulent event could be worth millions of dollars.

How does this influence the model?

If we decide that Type I errors are more costly, we would set our model’s classification threshold to be more conservative. We would only flag a transaction as fraud if the model is extremely confident. This would reduce the number of false positives but would inevitably increase the number of false negatives (Type II errors). It’s a strategic business decision, and as data scientists, our job is to present this trade-off clearly to stakeholders using tools like a precision-recall curve, so they can make an informed choice.

Module 3: The Bridge from Probability to Inference

LunaVerse — Fri, 25 Jul 2025 16:05:35 GMT

Topic 3.1: Foundational Theorems

Law of Large Numbers (LLN): It formally states what we already suspect: as you collect more and more data points from a random process, the average of your sample will get closer and closer to the true, theoretical average (the expected value) of the entire population.
Central Limit Theorem: It states that if you take sufficiently large random samples from a population — regardless of the population’s original distribution (it can be skewed, bimodal, uniform, anything!) — the distribution of the sample means will be approximately Normal (a bell curve).

Question: Which of the following statements is a direct consequence of the Central Limit Theorem?

(A) The more data you collect, the closer your sample mean will be to the population mean.

describes the Law of Large Numbers.

(B) The distribution of a sample of size n=1,000,000 from a right-skewed population will be approximately Normal.

is incorrect. The sample itself will still reflect the skewed shape of the population.

is the correct definition of the Central Limit Theorem. It’s about the distribution of the statistic (the sample mean), not the sample itself.

(D) The variance of the sample mean decreases as the sample size increases.

is a true statement related to the sampling distribution, but it’s not the core statement of the CLT.

Question: The individual weights of a species of fish are known to follow a highly skewed distribution with a population mean (μ) of 5 kg and a population standard deviation (σ) of 2 kg. If we take a random sample of 100 fish, what is the approximate probability that the average weight of the sample is less than 4.8 kg?

Solution: Even though the original distribution is skewed, the CLT tells us the distribution of the sample mean (x̄) will be approximately Normal because the sample size (n=100) is large.

The mean of this new sampling distribution is the same as the population mean: μ_x̄ = μ = 5 kg.
The standard deviation of the sampling distribution (called the Standard Error) is σ_x̄ = σ / √n = 2 / √100 = 2 / 10 = 0.2 kg.
Now we have a standard Normal distribution problem. We need to find P(x̄ < 4.8).
Convert 4.8 to a Z-score: Z = (x̄ — μ_x̄) / σ_x̄ = (4.8–5) / 0.2 = -0.2 / 0.2 = -1.0.
We need to find P(Z < -1.0). From a standard normal table, this value is approximately 0.1587.

The approximate probability is 15.87%.

Topic 3.2: Sampling Distribution

Imagine you’re a data scientist at a fintech company, and you want to know the average monthly spending of your entire user base (the population). It’s impossible to ask everyone. So, you take a sample of 1,000 users and find their average spending is $450. But what if you had, by chance, picked a different sample of 1,000 users? Their average might have been $445. A third sample might yield $458.

A sampling distribution is the theoretical probability distribution of a statistic (like the sample mean) that would result from drawing all possible samples of a given size from a population. The histogram of all these possible sample means is the sampling distribution of the mean.

Thanks to the CLT, we know this distribution is approximately Normal.

Key Sampling Distributions

t-Distribution: The realistic cousin of the Normal distribution. We use it when we test hypotheses about a mean but we have a small sample size (typically n < 30) AND/OR we do not know the true population standard deviation (σ). The t-distribution looks like a Normal distribution but has “fatter tails.” This accounts for the extra uncertainty we have because we’re estimating the population’s spread from a small sample.

The ‘fatter tails’ mean that the t-distribution acknowledges a higher probability of extreme values occurring, which is a direct result of the extra uncertainty we have from a small sample.

Chi-squared (χ²) Distribution: This distribution is not for testing means. It’s the workhorse for tests involving categorical data or population variance. It answers questions like: “Is the distribution of new sign-ups across different marketing channels (Google, Facebook, Direct) what we expected?” (Goodness of Fit test) or “Is there a relationship between a user’s country and the product tier they choose?” (Test for Independence).

Question: A data analyst takes a sample of 25 students to estimate their average daily screen time. The population standard deviation is unknown. To calculate a confidence interval for the population mean screen time, which distribution should be used as the basis for the critical value?

(A) Standard Normal (Z) distribution

(B) t-distribution

(D) Binomial distribution

Solution: The two key phrases are “sample of 25” (which is small, n < 30) and “population standard deviation is unknown.” These are the classic conditions that require the use of the t-distribution instead of the Z-distribution. Chi-squared is for categorical data/variance, and Binomial is for counts of successes.

Question: A company expects its customer support tickets to be distributed evenly across four categories: A, B, C, D. In a given week, they receive 100 tickets, with the following observed counts: A=30, B=25, C=25, D=20. Calculate the Chi-squared (χ²) test statistic for this data.

Solution:

Expected Counts: With 100 tickets and 4 even categories, the expected count for each is E = 100 / 4 = 25.

Chi-squared Formula: χ² = Σ [(Observed — Expected)² / Expected]

Calculate for each category:

A: (30–25)² / 25 = ⁵² / 25 = 25 / 25 = 1.0
B: (25–25)² / 25 = ⁰² / 25 = 0
C: (25–25)² / 25 = ⁰² / 25 = 0
D: (20–25)² / 25 = (-5)² / 25 = 25 / 25 = 1.0

Sum them up: χ² = 1.0 + 0 + 0 + 1.0 = 2.0

The Chi-squared test statistic is 2.0

Interviewer: We just launched a new app feature and collected two types of data:

The average session duration for users who engaged with the feature.
A count of users categorized by their final action: ‘Used feature and churned,’ ‘Used feature and retained,’ ‘Ignored feature and churned,’ ‘Ignored feature and retained.’

We want to test for statistical significance in both cases. Which sampling distribution — t-distribution or Chi-squared — would you use for each analysis, and why?

Answer:

For the average session duration, I would use a framework based on the t-distribution. My goal is to compare the mean session duration of two groups (e.g., users with the feature vs. users without). Since this involves testing a hypothesis about a mean and I’d be working with sample data where the true population standard deviation is unknown, a two-sample t-test, which relies on the t-distribution, is the appropriate tool.
For the count of users by action category, I would use the Chi-squared distribution. Here, the data is categorical, not a mean value. The question I’m trying to answer is whether there is a statistically significant association between using the feature and the outcome of churning. A Chi-squared test for independence is designed precisely for this: to compare observed frequencies in a contingency table (like the one described) to the frequencies we would expect if there were no relationship between the two variables. The resulting test statistic follows a Chi-squared distribution.

Module 2: Random Variables & Their Distributions (Part 2)

LunaVerse — Thu, 24 Jul 2025 16:09:40 GMT

Topic 2.3: Continuous Random Variables and Distributions

If discrete distributions are about counting, continuous distributions are about measuring essential for modeling real-world phenomena like time, money etc.

Probability Density Function (PDF): Unlike a PMF, the value of the PDF at a specific point f(x) is not a probability. It represents a density. To get the probability, you must find the area under the curve of the PDF over some interval. This leads to the most important and often confusing rule for continuous RVs: P(X=x) = 0.

The probability of a user streaming for exactly 2.00000… hours is zero, because there are infinite possible values. We can only talk about the probability of them streaming between, say, 1.9 and 2.1 hours.

Cumulative Distribution Function(CDF): The CDF is again F(x) = P(X ≤ x). For a continuous variable, it’s the total area under the PDF from the beginning up to the point x. It’s a smooth, non decreasing curve from 0 to 1.

Key Distributions:

Uniform (Continuous): The “Complete Uncertainty” model. Any value within a range [a, b] is equally likely.

Exponential: The “Time Until-Next-Event” specialist. This is the continuous cousin of the Poison distribution. If events arrive at a constant average rate(Poisson), the waiting time between those events follows an Exponential distribution. It is “memoryless”.

Continuous and Right-skewed
Let T be the waiting time. The CDF of an Exponential distribution is F(t) = P(T ≤ t) = 1 — e^(-λt).
a single parameter (lambda, λ) representing the rate, a mean of 1/λ, and a variance of 1/λ²
Memoryless property: The probability of an event happening in the future is not influenced by how long it has already been since the last event. For example, if a light bulb is still working after 1000 hours, the probability it will fail in the next hour is the same as the probability of it failing in the first hour.
The Poisson distribution models the number of events in a fixed interval, while the exponential distribution models the time between those events.

Normal (Gaussian): symmetrical, bell-shaped curve. mean=mode=median — located at the center of the distribution making it unimodal distribution(only one peak).

The distribution is also asymptotic, meaning its tails approach the x-axis but never touch it
The standard deviation (σ) measures the spread of the data around the mean. The empirical rule (also known as the 68–95–99.7 rule)
A normal distribution has a central tendency; the mean represents the most likely value, and values deviate from it with decreasing probability.
It models the distribution of sample means due to the Central Limit Theorem.
This theorem focuses on the distribution of sample means. It says that if you take many samples of the same size from a population, calculate the mean of each sample, and then plot those sample means, the resulting distribution will be approximately normal.

Standard Normal (Z-distribution): The “Universal Translator”. It’s a special Normal distribution with μ=0 and σ=1. We use it to standardize any normal variable into a “Z-score”, allowing us to compare different scales.

Z-score formula: Z = (X — μ) / σ

Question: (Exponential): Customer support calls arrive at a call center according to a Poisson process with an average rate of 10 calls per hour. What is the probability that the waiting time until the next call is more than 15 minutes.

Solution: The waiting time between events in a Poisson process follows an Exponential distribution.

First, align the units. The rate λ is 10 calls/hour. The question is in minutes. λ = 10 calls / 60 minutes = 1/6 calls per minute.
Let T be the waiting time. The CDF of an Exponential distribution is F(t) = P(T ≤ t) = 1 — e^(-λt).
The question asks for P(T > 15). This is the complement of P(T ≤ 15).
P(T > 15) = 1 — P(T ≤ 15) = 1 — [1 — e^(-(1/6) * 15)] = e^(-15/6) = e^(-2.5).
e⁻²·⁵ ≈ 0.082.

There is approximately an 8.2% chance that the wait for the next call will be longer than 15 minutes.

Question: (Normal) The daily data processing time for a job on a server is normally distributed with a mean(μ) of 120 minutes and a standard deviation (σ) of 15 minutes. What is the probability that a job will take between 90 and 135 minutes?

Solution: We need to convert the values 90 and 135 into standard Z-scores.

Z-score formula: Z = (X — μ) / σ
Z₁ (for X=90): Z₁ = (90–120) / 15 = -30 / 15 = -2.0.
Z₂ (for X=135): Z₂ = (135–120) / 15 = 15 / 15 = 1.0.
The question is now P(-2.0 ≤ Z ≤ 1.0).
This is calculated as P(Z ≤ 1.0) — P(Z ≤ -2.0). Using a standard normal table or calculator:

P(Z ≤ 1.0) ≈ 0.8413
P(Z ≤ -2.0) ≈ 0.0228

P(-2.0 ≤ Z ≤ 1.0) = 0.8413–0.0228 = 0.8185.

There is an 81.85% probability that the job will complete between 90 and 135 minutes.

Interviewer: “Why is Normal so Famous?”

Answer: Reason is Central Limit Theorem. The theorem states that if you take sufficiently large random samples from any population, regardless of its original shape, the distribution of the sample means will be approximately normal.

This is incredibly powerful. It means even if our raw user spending data is skewed, we can still use techniques like t-tests, which rely on normality, to analyze the average spending of different user groups. It’s the distribution of the statistic (like the mean), not necessarily the raw data, that makes the Normal distribution so ubiquitous in inference.

Topic 2.4: Properties of Random Variables

let’s use two random variables from an e-commerce context:

X: The number of items a customer buys in a month (a discrete RV).
Y: The total amount of time (in minutes) they spend on the website in a month (a continuous RV).

Expectation (Mean, E[X]): This is the long-run average value of the random variable if we were to repeat the experiment an infinite number of times. It’s the “center of mass” of the distribution. It answers the question, “What value do we expect on average?”

Example: If E[X] = 3.2, it means that over thousands of customers, the average number of items purchased per customer per month is 3.2. Note that no single customer can buy 3.2 items, but the average can be a non-integer.

Variance (Var(X) or σ²): This measures the spread or dispersion of the distribution around its mean. A low variance means most values are clustered tightly around the average. A high variance means the values are very spread out. The standard deviation (σ) is the square root of the variance, bringing the units back to the original scale (e.g., items, not items²).

Example: If two customer segments both have an E[X] = 3.2, but Segment A has a Var(X) = 1 and Segment B has a Var(X) = 20, it tells us Segment A is very predictable (most buy 2–4 items), while Segment B is unpredictable (some buy 0, some buy 20). This is crucial for inventory management.

Covariance (Cov(X, Y)): This measures the joint variability of two random variables. It tells us the direction of the linear relationship.

Positive Covariance: When X tends to be above its mean, Y also tends to be above its mean (e.g., as items purchased goes up, time on site goes up).
Negative Covariance: When X tends to be above its mean, Y tends to be below its mean.
Zero Covariance: No linear relationship.
The problem with covariance is that its value (e.g., 250.7) is hard to interpret. Is that a strong or weak relationship? This leads us to…

Correlation (Corr(X, Y) or ρ): This is the normalized version of covariance. It scales the value to be between -1 and +1, making it universally interpretable.

+1: Perfect positive linear relationship.
1: Perfect negative linear relationship.
0: No linear relationship.
By dividing the covariance by the product of the standard deviations, it normalizes the metric.
Example: If Corr(X, Y) = 0.8, it signifies a strong positive linear relationship. This is a very useful feature for a recommendation engine — people who buy more also tend to browse more, so we can target them with more recommendations.

Question: (Expectation & Variance): A discrete random variable X has the following probability distribution: P(X=0) = 0.2, P(X=1) = 0.5, P(X=2) = 0.3. Calculate the Expectation E[X] and the Variance Var(X). Solution:

Expectation E[X]: E[X] = Σ [x * P(X=x)]

E[X] = (0 * 0.2) + (1 * 0.5) + (2 * 0.3) = 0 + 0.5 + 0.6 = 1.1

Variance Var(X): Var(X) = E[X²] — (E[X])²

First find E[X²]: E[X²] = Σ [x² * P(X=x)]

E[X²] = (⁰² * 0.2) + (¹² * 0.5) + (²² * 0.3) = 0 + (1 * 0.5) + (4 * 0.3) = 0.5 + 1.2 = 1.7

Var(X) = 1.7 — (1.1)² = 1.7–1.21 = 0.49

Question: (Correlation): Given two random variables X and Y. The variance of X is 16, the variance of Y is 25, and the covariance between X and Y is -10. What is the correlation coefficient ρ(X, Y)?

Solution: formula for correlation: ρ(X, Y) = Cov(X, Y) / (σ_X * σ_Y)

The standard deviations are:

σ_X = √Var(X) = √16 = 4
σ_Y = √Var(Y) = √25 = 5

ρ(X, Y) = -10 / (4 * 5) = -10 / 20 = -0.5

The correlation coefficient is -0.5

Interviewer: (“Feature Selection” ) We are building a machine learning model to predict house prices. You find two features, number_of_bathrooms and square_footage, that have a very high correlation, say 0.9. How does this finding impact how you would treat these features before feeding them into a model like Linear Regression?

Answer: A correlation of 0.9 indicates strong multicollinearity. This is a problem for models like Linear Regression because it becomes difficult for the model to disentangle the individual effect of each feature on the house price. The model coefficients can become unstable and have high standard errors, making them hard to interpret.

I would not include both highly correlated features in the model. I would choose one of them based on a few criteria:

Correlation with Target: I’d keep the feature that has a slightly higher correlation with the target variable, price.
Completeness: I’d check which feature has fewer missing values.
Alternatively, I could use a dimensionality reduction technique like Principal Component Analysis (PCA) to combine the information from both features into a new, single component. For tree-based models like Random Forest or Gradient Boosting, multicollinearity is less of an issue for predictive accuracy, but removing one feature can still make the model simpler and faster to train.
Business Sense: Which feature is more fundamental? Square_footage is likely more a direct driver of price than number_of_bathrooms.

Interviewer: (“Correlation vs. Covariance”) Why do data scientists almost always report correlation but rarely report covariance? If covariance tells you the direction of a relationship, isn’t that useful enough?

Answer: While covariance does tell us the direction of the relationship — positive or negative — its magnitude is not standardized, making it almost impossible to interpret or compare. It provides a universal, unitless measure of the strength of the linear relationship, bounded between -1 and +1. This makes it easy to say ‘a correlation of 0.8 is stronger than 0.4’, a comparison you simply can’t make with raw covariance values.

Module 2: Random Variables & Their Distributions (Part 1)

LunaVerse — Thu, 24 Jul 2025 15:47:46 GMT

We learn to model uncertainty itself using mathematical functions, which is the core of statistical modeling.

Topic 2.1: Introduction to Random Variables (RVs)

A Random variable is not a traditional variable like x=5. It’s a process or a function that maps the outcomes of a random event to a numeric value. We use the power of mathematics on phenomena that are inherently uncertain.

Discrete Random Variable: The variable can only take on a countable number of distinct values. Example- Count of students in class. X = {0,1,2, ..}.
Continuous Random Variable: The variable can take on any value within a given range. Example- Height of Students in class. Y = {150, 150.5, .. 170}

Question: From the following list, identify which can be modeled by a discrete random variable and which by a continuous random variable:

The number of detective items in a shipment of 50. — Discrete
The weight of a grain of rice. — Continuous
The number of cars that pass through an intersection in an hour — Discrete
The exact temperature of a data center’s server room. — Continuous
The stock price of a company. — Continuous

Question: An experiment consists of flipping a fair coin twice. Let the random variable Z be the number of Heads observed. What are the possible numerical values of Z, and what is the probability P(Z=1)?

Solution: Sample space S = {HH, HT, TH, TT}

The Random variable Z maps these outcomes to numbers:

HH → Z =2, HT→ Z=1, TH → Z=1, TT→ Z=0

Possible values of Z are {0,1,2}.

Therefore, P(Z=1) = 2/4 = 0.5

Interviewer: In our user database, we have a column called did_churn which is either TRUE or FALSE. Why do we need to go through the trouble of defining a ‘random variable’ for this? Why can’t we just work with the TRUE/FALSE labels directly? What practical advantage does defining it as a random variable give us?

Answer: By defining a random variable, say X, where X=1 if the user churned and X=0 if they didn’t, we translate a categorical outcome into a number. This translation is what unlocks the entire toolkit of statistics and mathematics.

Topic 2.2: Discrete Random Variables and Distributions

Let’s take an example, a user is shown 3 different pop-up ads. Let x be the discrete random variable representing the number of ads they click. The possible values for X are {0,1,2,3}.

Probability Mass Function (PMF): This function gives you the probability of the RV being equal to a specific value. It’s the height of the bar in a bar chart for that value. We write it as P(X = k).

PMF tells us that :
P(X = 0) = 0.5 (50% chance they click no ads)
P(X = 1) = 0.3 (30% chance they click exactly one)
P(X = 2) = 0.15 (15% chance they click exactly two)
P(X = 3) = 0.05 (50% chance they click all three)

Note — The sum of all PMF values must be 1.

Cumulative Distribution Function (CDF): This function gives you the probability of the RV being less than or equal to a specific value. It “accumulates” probability as you go. We write it as F(k) = P(X≤ k).

Using PMF above :
F(0) = P(X ≤ 0) = P(X=0) = 0.5
F(1) = P(X ≤ 1) = P(X=0) + P(X=1) = 0.5 + 0.3 = 0.8
F(2) = P(X ≤ 2) = P(X=0) + P(X=1) + P(X=2) = 0.8 + 0.15 = 0.95
F(3) = P(X ≤ 3) = P(X=0) + P(X=1) + P(X=2) + P(X=3) = 0.95 + 0.05 = 1

The CDF is always non-decreasing and goes from 0 to 1. It looks like a staircase for discrete variables.

Key Distributions

Bernoulli: The “Yes/No” specialist. Models a single trial with two outcomes. It’s the fundamental building block.

Two outcomes — Probability of success -”p” and failure -”1-p”.
The mean (expected value) of a Bernoulli distribution is equal to ‘p’. The variance is p(1-p).
Foundation for binomial distribution

Binomial: The “Success Counter in N trials” . Models the number of successes in a fixed number of “independent” Bernoulli trials.

Each trial results in either success or failure.
Constant Probability of Success: The probability of success (p) remains the same for every trial.

p is the probability of success in a single trial
(1-p) is the probability of failure
n is the number of trials
k is the number of successes

Poisson: The “Event Rate Modeler”. Models the number of events occurring in a fixed interval of time or space, when you know the average rate. (e.g., number of customers arriving at a store per hour.)

assumption — mean is equal to variance.
The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events.

where
x = 0,1,2,3….
λ = mean number of occurrences in the interval
e = Euler’s constant = 2.71828

Uniform (Discrete): The “Fairness” model. Every outcome is equally likely (e.g., rolling a fair die). Also known as rectangular distribution. (PDF is a horizontal line within the range, indicating equal probability for all values).

Can be either discrete or continuous.
PDF: f(x) = 1 / (b — a)
Cumulative Distribution Function (CDF): F(x) = (x — a) / (b — a) for a ≤ x ≤ b
A uniform distribution has no central tendency; every value within the range is equally probable.

Question: (BInomial) A company runs an A/B test on a new “Buy now” button. Historically, the old button has a click-through rate of 10%. They show the new button to 20 users. Assuming the new button has the same 10% effectiveness, what is the probability that exactly 3 users click it?

Solution: It’s a binomial problem because we have a fixed number of trails (n=20) and we are counting successes (clicks).

n=20, p=0.10(probability of success), k =3(number of successes we want).

P(X=k) = nCk * p^k * (1-p)^(n-k)
P(X=3) = ²⁰C₃ * (0.1)³ * (0.9)¹⁷
²⁰C₃ = (20 * 19 * 18) / (3 * 2 * 1) = 1140
P(X=3) = 1140 * 0.001 * 0.16677… ≈ 0.1901

There is approximately a 19.01% chance that exactly 3 users click the new button.

Question: (Poisson) A data center’s support desk receives requests at an average rate of 4 requests per hour. What is the probability that they will receive zero requests in the next hour?

Solution: It’s Poisson problem because we are modeling the number of events in a fixed interval (1 hour) given an average rate.

λ (lambda) = 4 requests/hour. We want to find the probability of k=0 events.

Poisson PMF formula: P(X=k) = (λ^k * e^-λ) / k!
P(X=0) = (⁴⁰ * e⁻⁴) / 0!
P(X=0) = (1 * e⁻⁴) / 1 = e⁻⁴ ≈ 0.0183

There is approximately a 1.83% chance of receiving no support requests in the next hour.

Interviewer: (Choose your model) Which discrete distribution would you choose as a starting point?

The number of users out of a batch of 500 who will unsubscribe after receiving a promotional email.

Binomial distribution — because it models the number of successes(unsubscribes) in a fixed number of trials (500 users).

b. Whether a single, high-value transaction is fraudulent or not.

Bernoulli distribution because it’s a single event with only two possible outcomes: fraudulent or non-fraudulent.

c. The number of typos a content writer makes per article

Poisson distribution because it’s designed to model the count of events(typos) occurring within a fixed interval or space (in this case, one article)

Module 1: The foundations — Describing Data and Basic probability

LunaVerse — Thu, 24 Jul 2025 15:36:57 GMT

Module 1: The foundations — Describing Data and Basic probability

Topic 1.1: Counting (Permutations and Combinations)

Counting is about figuring out how many different ways you can select or arrange them.

Permutation: An arrangements of items where order matters.

Example: You have three friends: Alice (A), Bob (B), and Charles ©. How many ways can they finish 1st and 2nd in a race? The pairs could be (A, B), (B, A), (A, C), (C, A), (B, C), (C, B). Notice that (A, B) is different from (B, A) because the order of finishing matters.
Formula: The number of permutations of selecting k items from a set of n is P(n, k) = n! / (n-k)!

Combination: A selection of items from set where order does NOT matter.

Example: From the same three friends (A, B, C), how many ways can you choose a team of two to go on a trip? The team could be {A, B}, {A, C}, or {B, C}. The team {A, B} is the same as {B, A}. We just care about who is on the team, not the order they were picked in.
Formula: The number of combinations of selecting k items from a set of n is C(n, k) = n! / (k! * (n-k)!)

Question: A data science team is being formed from a group of 6 Python programmers and 4 R programmers. The team must consist of 5 members. What is the number of ways the team can be formed if it must have at least 3 Python programmers?

Solution: Combination problem — since order of selecting team members doesn’t matter.

Scenario 1: 3 Python programmers and 2 R programmers
Ways to choose 3 Python from 6: C(6, 3) = 6! / (3! * 3!) = (654)/(321) = 20
Ways to choose 2 R from 4: C(4, 2) = 4! / (2! * 2!) = (4*3)/2 = 6
Total ways for Scenario 1 = 20 * 6 = 120
Scenario 2: 4 Python programmers and 1 R programmer
Ways to choose 4 Python from 6: C(6, 4) = 6! / (4! * 2!) = (6*5)/2 = 15
Ways to choose 1 R from 4: C(4, 1) = 4! / (1! * 3!) = 4
Total ways for Scenario 2 = 15 * 4 = 60
Scenario 3: 5 Python programmers and 0 R programmers
Ways to choose 5 Python from 6: C(6, 5) = 6! / (5! * 1!) = 6
Ways to choose 0 R from 4: C(4, 0) = 1
Total ways for Scenario 3 = 6 * 1 = 6

Since these are “OR” scenarios (the team can be formed in any of these ways), we add the totals:

Total ways = 120 + 60 + 6 = 186 ways.

Topic 1.2: Descriptive Statistics

We describe data using two main types of measures:

A) Measures of Central Tendency (Where is the “center” of my data?)

Mean (Average): The sum of all values divided by the number of values.
Best for: Symmetrical data (like a normal distribution) with no extreme values.
Vulnerability: Highly sensitive to outliers.
Median: The middle value when the data is sorted. If there’s an even number of values, it’s the average of the two middle ones.
Best for: Skewed data or data with significant outliers.
Robustness: It is not affected by outliers.
Mode: The value that appears most frequently.
Best for: Categorical data or discrete data. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal).

B) Measures of Dispersion/Spread (How spread out is my data?)

Variance (σ²): The average of the squared differences from the Mean.
Why squared? To prevent positive and negative differences from canceling each other out.
Standard Deviation (σ): The square root of the variance.
Benefit: It brings the measure of spread back to the original units of the data.
Interpretation:
A low standard deviation means the data points are clustered close to the mean.
A high standard deviation means they are spread out over a wider range.

C) Measures of Shape (What is the “shape” of my data’s distribution?)

Skewness: Measures the asymmetry of the distribution.

Zero Skew: Perfectly symmetrical (like a Normal Distribution). Mean = Median = Mode.
Positive Skew (Right-skewed): The “tail” of the distribution is longer on the right. In this case, Mean > Median > Mode. Think of income distribution — most people earn an average amount, but a few billionaires pull the mean way up.
Negative Skew (Left-skewed): The “tail” is longer on the left, caused by a few unusually low values. Here, Mean < Median< Mode. Think of age at retirement — most people retire around 60–70, but a few retire very early, pulling the mean down.

Kurtosis: Measures the “tailedness” or “flattenedness” of the distribution. It tells you how much of the data is in the tails versus the center. It’s often compared to a normal distribution.

Leptokurtic (Positive Kurtosis > 0): “Thicker” or “heavier” tails and a sharper peak. This means there are more outliers than you’d expect in a normal distribution. Think of stock market returns, which have more extreme up/down days than a normal distribution would predict.
Mesokurtic (Kurtosis ≈ 0 or 3): Similar tailedness to a normal distribution. (Note: Some software reports “excess kurtosis” which is Kurtosis — 3, so a normal distribution has 0 excess kurtosis).
Platykurtic (Negative Kurtosis < 0): “Thinner” tails and a flatter peak. Fewer outliers than a normal distribution.

D) Outliers

An outlier is a data point that differs significantly from other observations. It can be caused by measurement error or it can be a genuine, novel data point.
Identifying them is crucial because they can skew your mean, inflate your variance, and violate assumptions of many machine learning models.
A common rule of thumb is to identify outliers as points that fall below Q1–1.5IQR or above Q3 + 1.5IQR (where IQR is the Inter-Quartile Range, Q3 — Q1).

Question: Consider the following dataset representing the daily processing time (in minutes) for a batch job: [10, 12, 15, 15, 17, 19, 20, 21, 24, 90]. Calculate the Mean, Median, and describe the skewness of the data based on these values.

Solution: Data points = 10 • Mean = Sum / n = 243 / 10 = 24.3 minutes • Median = (5th +6th value)/2 = (17 + 19) / 2 = 36 / 2 = 18 minutes

Skewness: We see that Mean(24.3) > Median(18) = positively skewed(Right)

Interviewer: “You’ve calculated the mean and median for a dataset of customer purchase values. The mean is $150 and the median is $75. What does this tell you about your customers, and which metric would you report to your business stakeholders? Why?”

Answer: Here Mean(150) is significantly higher than the Median(75) tells that the data is positively skewed.

I would report both, but I would emphasize the median as the primary metric for the ‘typical’ customer experience.

Our typical customer spends about $75 per transaction. This median value gives us the most accurate picture of the central customer. However, it’s also important to note that our average purchase is $150. This higher average is driven by a smaller group of high-spending customers. This presents us with two distinct opportunities: first, how do we cater to and grow our core $75-customer base? And second, who are these high-value customers, and how can we find and retain more of them?

Topic 1.3: Core Probability Theory

Sample Space: The set of all possible actions
Event: A subset of the sample space. A= user adds to cart. B= user makes a purchase.
Independent vs. Mutually Exclusive:
Mutually Exclusive: The two events that cannot happen at the same time. If A happens, B cannot. P(A and B) = 0.
Independent: The outcome of one event gives you no information about the outcome of another. P(A and B) = P(A)*P(B).
Joint Probability P(A and B): The probability of both events happening. What is the probability a user adds_to_cart AND makes_a_purchase?
Marginal Probability P(A): The probability of a single event, irrespective of others. What is the overall a user adds_to_cart?
Conditional Probability P(A|B): The probability of event A happening given that event B has already occurred. “Given a user added_to_cart, what is now the probability they will make_a_purchase?” This P(Purchase | Add to Cart) is much higher than the general P(Purchase).

Question: An analytics company surveyed 1000 users. It found that 200 of them clicked on a promotional ad. Out of these 200, 80 made a purchase. From the 800 users who did not click the ad, 40 made a purchase. Let C be the event a user clicks the ad, and P be the event a user makes a purchase. Calculate P(P|C).

Solution: P(P|C) = P(P and C)/P(C)

P(P and C) = 80/1000 =0.08 = is the (joint) probability a user both clicked AND purchased.

P(C) = 200/1000=0.2 = is the marginal probability a user clicked the ad.

P(P|C) = 0.08/0.2 = 0.4

There is 40% probability that a user will make a purchase, given they have clicked the promotional ad.

Interviewer: Explain the difference between independent and mutually exclusive events. Can two events with non-zero probabilities be both?

Answer: Mutually exclusive events cannot occur at the same time. P(A and B)=0. Independent events are ones where the outcome of one has no bearing on the outcome of the other. P(A and B) = P(A)*P(B).

Two events with nonzero probabilities cannot be both mutually exclusive and independent. Since, P(A) and P(B) are non-zero, their product would be non-zero, which is a contradiction. Therefore, they cannot be both.

Topic 1.4: The Cornerstone Theorem: Bayes Theorem

At its core, Bayes Theorem is about updating your beliefs in light of new evidence. Imagine you are working on a medical diagnostic tool.

Event A: A patient has a specific rare disease.

Event B: The patient tests positive on a new diagnostic test.

Bayes theorem helps us answer : If a patient tests positive, what is the probability they actually have the disease? i.e., what is P(Disease | Positive test)?

The Prior P(A) or P(Disease): What is our belief before we see the test result? Since the disease is rare, this probability is low. Let’s say 1 in 10,000 people have it, so P(Disease) = 0.0001. This is our initial, or “prior” belief.
The Likelihood P(B|A) or P(Positive test | Disease): How good is our test at detecting the disease when it’s actually there? This is the test’s sensitivity. Let’s say it’s a good test and correctly identifies the disease 99% of the time. So, P(Positive test | Disease) = 0.99.
The Evidence P(B) or P(Positive test): What is the overall probability of any person (sick or healthy) testing positive? A person can test positive in two ways: a) they have the disease and test positive(a true positive), OR b) they don’t have the disease but still test positive (a false positive).

Bayes theorem lets us combine these pieces to find our Posterior belief — the updated probability P(Disease | Positive test).

Question: A company screens job applicants for a specific skill using a coding test. 60% of applicants who have the skill pass the test. 10% of applicants who do not have the skill also pass the test (they guess correctly). It is known that 20% of the total applicant pool has the skill. What is the probability that an applicant who passed the test actually has the skill?

Solution:

S: The applicant has the Skill

P: The applicant Passed the test

We need to find P(S | P).

P(P | S) = 0.6 (Likelihood: probability of passing given they have the skill)

P(P | S’) = 0.1 (Probability of passing given they do not have the skill)

P(S) = 0.2 (Prior: probability that any random applicant has the skill)

P(S’) = 1-P(S) = 0.8

State Bayes’ Formula: P(S | P) = [P(P | S) * P(S)] / P(P)

we use the Law of Total Probability:

(the total probability of passing) P(P) = P(P | S)P(S) + P(P | S’)P(S’)

P(P) = (0.6 * 0.2) + (0.1 * 0.8)

P(P) = 0.12 + 0.08

P(P) = 0.2

Substitute all values into Bayes Formula:

P(S | P) = (0.6*0.2)/0.2

P(S | P) = 0.12/0.2

P(S | P) = 0.6

The probability that an applicant who passed the test actually has the skill is 60%.

The Ultimate Probability & Statistics Study Plan (GATE DA + Interviews)

LunaVerse — Mon, 21 Jul 2025 13:16:02 GMT

This plan is structured in a bottom-up approach, starting with the fundamentals and moving towards application and inference.

Module 1: The Foundations — Describing Data & Basic Probability

This module covers the absolute basics. You can’t do anything else without mastering these concepts.

Counting (Combinatorics):

Permutations
Combinations
Why it’s first: These are the building blocks for calculating the size of sample spaces in classical probability.

2. Descriptive Statistics:

Measures of Central Tendency: Mean, Median, Mode (Understand the difference and when to use each, e.g., median for skewed data).
Measures of Dispersion/Spread: Variance, Standard Deviation, Range.
Why here: Before we even talk about probability, it’s essential to know how to summarize a set of data. This is the “D” in EDA (Exploratory Data Analysis).

3. Core Probability Theory:

Sample Space & Events
Axioms of Probability
Independent vs. Mutually Exclusive Events (A classic interview question is to explain the difference).
Joint, Marginal, and Conditional Probability.

4. The Cornerstone Theorem: Bayes’ Theorem

Derivation and formula.
Interview Focus: Be prepared to explain it with a real-world example (e.g., medical diagnosis, spam filtering). Understand the terms: Prior, Likelihood, Posterior.

Module 2: Random Variables & Their Distributions

This is the heart of probability theory, where we model uncertainty using mathematical functions.

Introduction to Random Variables (RVs):

What is a Random Variable?
Discrete RVs vs. Continuous RVs.

2. Discrete Random Variables & Distributions:

Probability Mass Function (PMF): The defining function for discrete RVs.
Cumulative Distribution Function (CDF): Introduce it here! It’s a universal concept for all RVs.
Key Distributions:
Bernoulli: The single coin flip.
Binomial: The sum of n independent Bernoulli trials.
Poisson: Modeling counts of events in a fixed interval (Note: Poisson is a discrete distribution, not continuous as sometimes mistakenly grouped).
Uniform (Discrete): e.g., a single die roll.

3. Continuous Random Variables & Distributions:

Probability Density Function (PDF): The defining function for continuous RVs. Emphasize that P(X=x) = 0.
Cumulative Distribution Function (CDF): Revisit and apply to continuous cases.
Key Distributions:
Uniform (Continuous): Equal probability over a range.
Exponential: “Time until next event” in a Poisson process.
Normal (Gaussian): The most important distribution in all of statistics.
Standard Normal: The Z-distribution (μ=0, σ=1).

4. Properties of Random Variables:

Expectation (Mean) and Variance of RVs.
Conditional Expectation and Variance. (This fits more logically here than with basic descriptive stats).
Covariance and Correlation: Quantifying the relationship between two random variables. A crucial topic for feature selection in ML.

Module 3: The Bridge from Probability to Inference

These are the powerful ideas that allow us to use sample data to make claims about an entire population.

Foundational Theorems:

Law of Large Numbers (LLN): (Interview Addition) Why sample means converge to the true population mean.
Central Limit Theorem (CLT): The single most important theorem for inference. Understand what it says and, crucially, what its assumptions are. It’s why the Normal distribution is so ubiquitous.

2. Sampling Distributions:

The concept of a sampling distribution (the distribution of a statistic, like the sample mean).
t-Distribution: When to use it (small sample size, unknown population variance).
Chi-squared Distribution: Used for tests involving categorical data or variance.

Module 4: Statistical Inference — Making Judgements from Data

This is where everything comes together for practical application. This is a huge area for interviews.

The Framework of Inference:

Estimation:
Point Estimation (e.g., sample mean).
Interval Estimation: Confidence Intervals. Be able to interpret a CI correctly (e.g., “We are 95% confident that this interval contains the true population mean”).
Hypothesis Testing:
The core logic: Null Hypothesis (H₀) vs. Alternative Hypothesis (H₁).
Type I & Type II Errors: (Crucial Interview Topic) False positives and false negatives.
p-value: The most famous (and misunderstood) concept. Be able to explain it to a non-technical person.
Significance Level (alpha, α).

2. Specific Statistical Tests:

z-Test: (for means, when population variance is known or sample is large).
t-Test: (one-sample, two-sample independent, paired). Interview Focus: Know the assumptions for each test!
Chi-squared Test: (Goodness of fit, Test for independence).

Module 5: Interview-Focused & Advanced Topics

These topics solidify your understanding and directly map to common data science job responsibilities.

Maximum Likelihood Estimation (MLE): (Crucial Addition)

What it is: The principle behind finding the parameters for many ML models (like Logistic Regression). Understand the intuition: “What parameters for my chosen distribution make my observed data most probable?”

A/B Testing: (The #1 Applied Stats Interview Topic)

This is simply a practical application of a two-sample hypothesis test.
Be ready to discuss the entire process: choosing a metric, setting up the experiment, determining sample size, and interpreting results.

2. Correlation vs. Causation:

A fundamental conceptual check. Always be prepared to discuss this.

3. (Bonus) Resampling Methods:

Bootstrapping: A powerful simulation-based method to estimate confidence intervals or the standard error of a statistic, especially when the theoretical distribution is unknown. Very practical.

Artificial Neural Network

LunaVerse — Sun, 24 Oct 2021 09:32:53 GMT

Introduction

One of the most influential technologies of the past decade is the artificial neural network. The fundamental piece of deep learning algorithms is the application we see every day, such as virtual assistants, image recognition(face ID for phone lock), amazon’s Alexa ai powered assistant, diagnosing health issues, and fraud detection. There are many things computers can do better than humans but our incredible brains are still a step ahead. That brings us to the simplified definition of artificial neural networks. Artificial neural networks are computational models inspired by the biological neural networks that constitute the brain allowing machines to learn to do tasks by considering examples.

Why Artificial Neural Networks?

Earlier conventional computers follow an algorithmic approach that is the computational system following a set of instructions in order to solve a particular problem. Unless the algorithm series that the computer needs to follow are known, the computer cannot solve a problem. So basically, first we need a person who knows the answer to the problem to set specific tasks for a computer to follow. This restricts the problem-solving capability of conventional computers to problems that we already understand and know how to solve.

What if we have no clue how to solve the problem, making the traditional approach a failure.

It was at that time, under the umbrella of artificial intelligence, artificial neural networks were introduced.

What are artificial neural networks(ANNs)?

ANNs are computational models that follow operations similar to that of neurons of the human brain to analyze and process the problem.

In trying to mimic the human brain the computer is able to learn, reason, predict and take decisions just like humans. ANNs gather knowledge by detecting the patterns and relationships between the sets of data and learning from their experience without task-specific programming. The more the data is trained, the more accuracy you get from the model.

The architecture of an artificial neural network

ANNs are nothing more than a network of artificial neurons. Here is a simple network.

Structure of Artificial Neural Network model

A neural network has many layers and the complexity of the model increases, as the number of layers gradually increases.

The purest form of neural network has three layers -

Input layer
Hidden layer
Output layer

A single artificial neuron is called a perceptron and when thousands of perceptrons are connected forming a network that has many hidden layers, it is called a multilayer perceptron.

At first, the input layers receive the input signals and transfer them to the hidden layers to process data, and interconnection between these two layers assign weights to each input randomly at the initial point. Then bias is added to each input neuron and after this, the weighted sum which is a combination of weights and bias is passed through the activation function. The activation function basically provides a threshold value. Neutrons above that value get fired. Thus, the function has the responsibility of which node to fire, and finally, the output is calculated. So the artificial neural network constitutes the composition of so many perceptrons, connected in different ways and operating on different activating functions. This whole process is known as Foreward Propagation.

In the case of multilayer perceptron, a common deep learning algorithm for supervised training is known as Backward Propagation. After getting the output model when compared with the original output, the error is known. Finally, to reduce the error the weights need to be updated in backward propagation and this process continues for a certain number of epochs. Finally, model weights get updated and prediction is done.

Training ANN model

To train the ANN model, it needs to flood with a huge amount of data provided as the training set. For example, if you want the system to recognize an image of a dog then we have to provide thousands of images as inputs in order to train it.

Artificial neural networks assign values to the weights of the connection between neurons. ANN model to get 100% accuracy has to adjust weights to the right numbers. In the example, that mentioned above when provided with more training examples, the neural network propagates back and adjusts its weights to map each input to the correct outputs. In the process, each layer detects specific features in the images. Finally, the adjusted weights of the neurons will be able to determine the image with accuracy. After ANN training, the data may produce output even with incomplete information. Thus making the network work intelligently.

Advantages of ANN

Storing information: Information is stored on the entire network, not on a database. The disappearance of a few elements of information does not prevent the network from functioning.
Fault tolerance: The corruption of one or more cells of ANN does not prevent it from generating output. This feature makes the networks fault-tolerant.
Gradual corruption: A network slows over time and undergoes relative degradation. The network problem does not immediately corrode immediately.
Parallel processing capability: Artificial neural networks have numerical strength that can perform more than one job at the same time.

Limitations of ANN

Amount of data: neural networks need thousands and millions of examples.
Neural networks lack generalizing: a neural network can only do the task it is trained for. Even if there’s a similar problem neural networks will not perform well.
Neural networks are BlackBox: though neural networks are able to do human-like decisions it’s very hard to find the factors(logics) that determined the decision.

Applications of neural networks:

Image recognition
Fraud detection
Forecasting
Stock market prediction
Healthcare
Handwriting analysis
Credit rating
Social media
Defense

Conclusion

The article gives a simplified and complete description of artificial neural networks.