duo, trēs

The train is now rolling full steam ahead — there’s little room to pause and ponder upon the scenery, or to run back to snap something you missed. We’re now sweeping over the land of models and algorithms. But first, a quick recap of the past two weeks:

Week 2: Probability and Statistics

P(A|B) = P(B|A)*P(A) / P(B)

Turns out there’s a whole lot more to Bayes’ work than that equation above. Sure, we might use the formula to uncover conditional probabilities that are difficult to detect by empirical observation. But more interestingly, we can leverage the Bayesian strategy of incorporating existing information to minimize ‘regret’ in the multi-armed bandit problem, or use it as a more robust alternative to the less interpretable p-values in conventional A/B testing. A key difference between Frequentist and Bayesian statistics is how the parameters of interest are interpreted. The Frequentist assumes these parameters are fixed but unknown quantities, unlike a Bayesian, who treats them as random variables. Seems like a subtle difference syntactically, but it’s an important difference. Perhaps most intriguing to me is that a Bayesian makes inferences conditional on the data observed but (typically) does not consider hypothetical data sets that were not observed.

In a sense, Bayesians get rid of the inconvenience of setting up hypothesis frameworks and solving for extra parameters, at the cost of specifying a prior distribution. An example of how the Bayesian framework can be applied is the idea of conjugate priors. If the prior distribution — the P(B|A) part in the equation — follows a beta distribution and the likelihood — P(A) is Bernoulli/Binomial, then the ‘posterior’ distribution — P(A|B) will also follow a beta distribution. This idea is known as conjugate priors, and is really powerful when predicting an event given a set of observations (e.g. you might want to predict if a coin is biased based on the number of heads/tails on its prior 10/100/1000 flips.) Another advantage of the Bayesian framework is that you’re allowed ‘peek’ at your data while your tests are running, and even adjust your features as you go. This would be highly dangerous if you were conducting a Frequentist hypothesis test — in which case the p-value would carry meaning only if the sample size/duration and goals were determined ahead of time.

(It wasn’t my intention to write exclusively about Bayesian statistics here — it just sort of happened… I think that’s pretty indicative of the type of rabbit hole-esque investigations you can find yourself in within the world of probability and statistics.)

Week 3: Linear & Logistic Regression

One of the misconceptions about linear regression (and I was definitely guilty of being in that camp) is that the idea is to fit a straight line through a set of data points. y=mx+b, we were told back in the day. Yes, but y=mx is more accurate. And y = βX is even better.

Understanding the equation above requires us to relax our assumptions of what each term actually is. y and X are actually matrices, not individual numerical values. y represents the n by 1 array that stores the dependent variable — the output, while X is the n by j array that stores the independent variables — the input. n is the number of data points you have, while j is the number of input variables/features. Finally, β is the array that contains all the coefficients for our regressors in X. One might ask, what happened to the intercept term b from our initial (misleading) equation? That’s simply the first element — βo, in the β matrix.

So how is our initial claim about linear regression being a straight line wrong? The ‘linear’ part simply refers that the parameters β0, β1, β2,…βj are linear. The equation y = β0 + β1X + β2X^2 gives us a quadratic relationship between X and y, but the model is linear in its parameters, so it is still linear regression. In this case, we have a U-shaped curve, but we can extrapolate that idea to a model with any number of polynomial terms. The result is a curve, not a straight line — debunking the falsity of our initial statement. If you can model your predictors to form curve with 20 bends that perfectly describes your data, why wouldn’t you just do that every single time? That opens up a whole new conversation about the idea of bias and variance, which I’ll save for a later post.

Closing Thoughts

I’m starting to see the world via this matrix paradigm — and while I’d be foolish to think that everything in this world can be distilled into rows and columns, I think it really helps puts things into perspective. Whether you’re forecasting a farmer’s crop yield to help manage food waste or uncovering behavioral patterns of suicide victims to curb suicide rates, it makes so much sense (to me at least) to visualize the world in matrices. Final note to self: be a good Bayesian and always update your beliefs.