My First Data Set
The start of my first project
Hello all. Are you having a good day? Do you have something that makes you giddy? I do. Like I mentioned before, I’ve grabbed a few data sets and decided to use one for my first project! I got the idea from the community Data Science Learning Club. Check it out if you’re interested.
In this project we are exploring a data set of our choosing. I wanted to use something related to one of my interests so I looked for a data set related to video games. I luckily managed to find one on IGN game reviews that someone on Reddit was kind enough to scrape from the website and open to the community. In today’s post I’ll talk about the first few steps I took for this activity.
At the end of the last post on matrices, I alluded to a bigger, badder data structure. These are called Data Frames. Data frames are a collection of vectors of a equal length. Unlike matrices and vectors, they can contain elements of different types which makes them useful for data analysis. Actually more often than not the data set you’ll be working will be stored in a data frame because of how convenient it can be.
The first thing you should do before analysis starts is to load in the data set. There are many different file types that they can be contained in and thankfully R has a slew of packages that can make the process easier. My IGN data set is contained in an Microsoft Excel file so I’ll be using the R package XLConnect to load it in.
library(XLConnect) ## This loads in the package so I can use it
games_df <- readWorksheetFromFile("gamedata.xlsx", sheet = 1)
I named the data frame “games_df”. I used the XLConnect command to load the data from the first excel sheet of games.xlsx. The next step is to get a general idea of the structure of the data. The easiest way to do this is by using the dim function:
 17534 4
The data set has 17,534 observations or rows and 4 columns. That means there are 17,534 games in this file! Holy crap that’s a ton! Ahem. To preview the first few rows within this set we can use the head function:
Game Platform Score Genre
1 Wolfenstein: The New Order Xbox One 7.8 Shooter
2 Mario Kart 8 Wii U 9.0 Racing, Action
3 Sportsfriends PlayStation 3 8.7 Action, Compilation
4 Sportsfriends PlayStation 4 8.7 Action, Compilation
5 Sportsfriends PC 8.7 Action, Compilation
6 Super TIME Force Xbox One 7.5 Shooter
I think this makes things easier to digest. You can see that all games fall under 4 columns: Game, Platform, Score, and Genre. The elements within the data set are either characters, numerics, or integers. If you want to preview the last few rows you can use the tail function. Haha.
That’s it for now. This is only the first part of this project and I don’t want to overwhelm you all. In the next post we’ll look for observations and patterns, possible hypotheses, and visualization of the data! Bar graphs, scatterplots, and histograms oh my!
If you learned something new and/or enjoyed this post hit the like button.