# Machine Learning for Data Analysis Week 3

## Context

In this project we are trying to establish links between personality traits and **Grit**. **Grit **is the tendency to sustain interest in and effort toward very long-term goals. Research has established the predictive power of **Grit** over and beyond measures of talent, for objectively measured success outcomes. For instance, in prospective longitudinal studies, **Grit **predicts surviving the arduous first summer of training at West Point and reaching the final rounds of the National Spelling Bee, retention in the U.S. Special Forces, retention and performance among novice teachers and sales agents, and graduation from Chicago public high schools.

The general consensus in academic psychology is that there are five fundamental personality traits from which derives an individual personality. They are: **Extroversion**, **Conscientiousness**, **Neuroticism**, **Agreeableness**, **Openness**. You can read more on the traits and the data used here. They have been used as well as some socio-demographic variables in the analysis below.

#### Lasso regression

— Python code here —

A lasso regression analysis was conducted to identify a subset of variables from a pool of 47 binary categorical and quantitative predictor variables that best predicted a quantitative response variable measuring **High Grit**.

**Preparing the data**Each categorical predictor has been converted to as many binary variables as there were categories in it. For instance the married categorical variable —

*What is your marital status?*— that could take three values has been decomposed in three binary predictors (often called dummy variables):

-Never married (0 False, 1 True)

-Currently married (0 False, 1 True)

-Previously married (0 False, 1 True)

Since categorical predictors don’t work in linear regressions.

Quantitative predictor variables include age, number of siblings from the same mother and the five fundamental personality traits.

All predictor variables were standardized to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (N=2630) and a test set that included 30% of the observations (N=1127). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

**Results**

Of the 47 predictor variables, 17 were retained in the selected model. During the estimation process, **Conscientiousness** and **Neuroticism** were most strongly associated with **High Grit**, followed by Black ethnicity and **Extraversion**. Having at most a high school education was negatively associated with **High Grit**. Other predictors associated with **High Grit** are indicated in the table below. In green are the variables positively associated with **High Grit**, and in red those negatively associated with the response variable. These 18 variables accounted for 28.9% of the variance in the **High Grit** response variable.