My First Data Set

The start of my first project

Hello all. Are you having a good day? Do you have something that makes you giddy? I do. Like I mentioned before, I’ve grabbed a few data sets and decided to use one for my first project! I got the idea from the community Data Science Learning Club. Check it out if you’re interested.

In this project we are exploring a data set of our choosing. I wanted to use something related to one of my interests so I looked for a data set related to video games. I luckily managed to find one on IGN game reviews that someone on Reddit was kind enough to scrape from the website and open to the community. In today’s post I’ll talk about the first few steps I took for this activity.

Data Frame

At the end of the last post on matrices, I alluded to a bigger, badder data structure. These are called Data Frames. Data frames are a collection of vectors of a equal length. Unlike matrices and vectors, they can contain elements of different types which makes them useful for data analysis. Actually more often than not the data set you’ll be working will be stored in a data frame because of how convenient it can be.

The first thing you should do before analysis starts is to load in the data set. There are many different file types that they can be contained in and thankfully R has a slew of packages that can make the process easier. My IGN data set is contained in an Microsoft Excel file so I’ll be using the R package XLConnect to load it in.

`library(XLConnect)  ## This loads in the package so I can use itgames_df <- readWorksheetFromFile("gamedata.xlsx", sheet = 1)`

I named the data frame “games_df”. I used the XLConnect command to load the data from the first excel sheet of games.xlsx. The next step is to get a general idea of the structure of the data. The easiest way to do this is by using the dim function:

`dim(games_df)[1] 17534   4`

The data set has 17,534 observations or rows and 4 columns. That means there are 17,534 games in this file! Holy crap that’s a ton! Ahem. To preview the first few rows within this set we can use the head function:

`head(games_df)                       Game      Platform Score               Genre1 Wolfenstein: The New Order      Xbox One   7.8             Shooter2               Mario Kart 8         Wii U   9.0      Racing, Action3              Sportsfriends PlayStation 3   8.7 Action, Compilation4              Sportsfriends PlayStation 4   8.7 Action, Compilation5              Sportsfriends            PC   8.7 Action, Compilation6           Super TIME Force      Xbox One   7.5             Shooter`

I think this makes things easier to digest. You can see that all games fall under 4 columns: Game, Platform, Score, and Genre. The elements within the data set are either characters, numerics, or integers. If you want to preview the last few rows you can use the tail function. Haha.

That’s it for now. This is only the first part of this project and I don’t want to overwhelm you all. In the next post we’ll look for observations and patterns, possible hypotheses, and visualization of the data! Bar graphs, scatterplots, and histograms oh my!

If you learned something new and/or enjoyed this post hit the like button.

Like what you read? Give Kerry Benjamin a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.