Looking back at the first week, it was certainly a time for adjustments. It has been great getting into the learning mindset. Certainly asking all the questions that spring to mind!
Coursework for the week was mostly a review of python, some descriptive statistics, data visualization, git, and Tableau. That all came together for the first project, which was looking at SAT scores for the states of the US.
The data was presented as the averages for each state in both the Verbal and Math sections of the exam, along with the participation rate of eligible students.
The first start was just to play around with the data. Using graphs to observe the distributions, then coming up with an appropriate way to present the data and draw assumptions.
It’s tricky, since there wasn’t a lot of stats applied to really be able to confidently infer anything. Still the creation of graphs themselves is incredibly handy to get started in looking for what question to ask.
I started by graphing out a histogram in seaborn for each of the categories present. The example for Math Scores is present below:
It was a great way to start, but it only really showed one piece of the data. I wanted to know if there would be a way for me to present multiple graphs to be able to look for relationships across all three variables. This started with me creating a 3D scatterplot:
Which was interesting, but not too intuitive to read. In order to make it easier to look at a collection of scatterplots and distributions, was made. It was complimented by a kernel density estimation graph.
From the data, it would appear that having a higher rate of participation would cause there to be both a lower Verbal and a lower Math score on the SAT. Would seem that there is some slight survivorship effect going on with the data, people taking the exam in areas of lower participation are probably going in more prepared on average than those from areas of higher participation.
To note about the kernel density, it typically works better with much finer datasets, but was an interesting way to present the data, and if it had been a collection of scores from every school in the country, would have been much more precise (and interesting!)
As a final part to my project there was an introduction to Tableau, which led to one of the most exciting troubleshooting events, fixing Nebraska’s disappearance.
When importing the data the state names were stored as their state abbreviations. This caused for a minor problem in Tableau since the state abbreviation for Nebraska from the data was NB, which is also the province code for New Brunswick. The fix just involved changing the alias for Nebraska.
It was a well documented problem, and will be very useful to know when it comes to dealing with names shared across a of places.
Was great to start to look at visualization and get feet wet into the world of data science.