The Crux of Data Science
People hire data scientists to make predictions about the future and thus verify a course of action to benefit their firm. The fundamental key to this is creating a “model” to fit to a dataset to explain it. Creating our models is an intricate process, choosing which variables we believe to have the most significant contributions to the projected output, scaling these against what we actually observe, adjusting the models’ scales, and other tasks. But the world constantly shifts, and no one model, as much as people would wish, is going to hold true for all its variations.
But we can also consistently adjust our models and add new ones to adjust for these variations. Constant attentiveness to and calibration on the models will, if our intuitions and insights are on point, will result in a predictive capability that bears closer resemblance to real world world observations. This makes the job of a data scientist more multi-layered and important than simply using tools to make a model.
Making these adjustments to our models when necessary is at the crux of what it means to be a data scientist. Through a combination of our mathematical and coding skills and our intuitions about the real world, we need to determine these factors and how they change constantly.
For example, when we went through a dataset of passengers on the Titanic, including factors of their gender, class, age, whether they had children, where they embarked, whether they survived, and others, we were set with the goal of determining which factors contributed most to their likelihood of surviving. For this task, we used “feature selection” methods from Scikit-Learn, a toolbox of data analysis and data mining features, to determine which of these features have the largest effect on whether or not a passenger survived. Some of these features assign each of the features one-by-one onto an equation that determines the likelihood of survival, whether by starting with all of the features (a top-down method), or with one (a bottom-up method), or by going through all of them randomly (a random method). I first set my output or “y” as the “Survived” column, set my input or “X” as the other features I deemed relevant for survival, and used different feature selection methods to reach a set of variables.
One example of these is “Select K best,” a top-down method that selects the “k” (a number you assign) amount of best features for your model. Running this on the Titanic dataset looking for five features, I found the best predictors of survival were passenger class, gender, age, whether they had children, and their paid fare. Intuitively, these all make sense — first class passengers would be most likely to survive, women would be prioritized, older passengers wouldn’t be as likely to survive, children would be prioritized, and higher paying customers would be likely to be in first class (noting that, we might have a case of collinearity with passenger class, which should be investigated later).
Each day in class we learn more about the various ways these factors or features could affect our output, and the subsequent various ways we can adjust our model to takes these ways into account. Whether through scaling our variables, increasing or decreasing their effect on the output, or through some other means we can consistently adjust our models to take the real world into account. This, as I have learned through intense theory and practice, is the crux of Data Science.