A Beginner’s Guide to Data Analysis

Published in

CodeX

13 min readJul 22, 2021

If you’re unfamiliar with the domain of data analytics, whether as a new member of the field or merely an outside observer, then there’s a chance that your knowledge of data analysis doesn’t extend much further than your high school or college statistics class. And if you’re under the impression that the rudimentary concepts from that class, which you kind-of-but-not-really learned as you gawked at your crush from across the room, are probably not very important at the professional level, then you would be sorely mistaken.

Even professionally, some of the most basic ideas in statistics — averages, medians, standard deviations, etc. — are often some of the first values that analysts seek to define when beginning analysis on a new dataset. In this article I’m going to give a walkthrough in layman’s terms of a basic data analysis project on SAT and ACT performance data, whose process includes significantly more than just a statistical analysis. Let’s dive in!

*Please note that this article is not a guide to the Python programming language. Although I reference many blocks of code that were used throughout my project to achieve certain objectives, the focus of this article is on the data science/analytics process itself, which can be performed using a variety of programming languages.

Data Collection/Import

The first step in data analysis is simple: collect your data. For more advanced projects, in order to know which data may be useful, you might need to begin by identifying the problem you’re attempting to solve. From there you might decide to search for existing datasets online or collect data manually from various sources. However, for this project I used pre-crafted datasets (link — https://blog.collegevine.com/here-are-the-average-sat-scores-by-state/) that were provided to me in a classroom environment. These .csv files include SAT and ACT performance data from 2017 and 2018, some of which can be seen here:

Once imported into my notebook, the first thing I did was take an overarching glance at the datasets to familiarize myself with their contents. The SAT datasets for each year include each state’s participation level, average scores for Reading and Math, and the average Total scores. The ACT datasets for each year include each state’s participation level, average scores for English, Math, Reading, and Science, and finally their Composite scores. Curiously, I noticed a wide range of participation levels, ranging from the single digits up to 100%. We’ll revisit that later in the article.

Data Cleaning/Organization

The second step in the process, sometimes called munging or scrubbing, involves cleaning and organizing the data in order to prepare it for analysis. If you’re new to this, you might wonder that means. As an example, let’s take a look at the snippet of data below:

As you can see, the average Math score in Maryland on the 2017 SAT is allegedly a 52 out of 800. Now you don’t need to be an expert in public education policy or statistics to be skeptical about that number. Even Lenny from Of Mice and Men could have scored higher than that. As I mentioned previously, this dataset was provided to me in a classroom environment, where it was intentionally modified with errors and typos for my peers and I to identify and correct. To check this potential error, I compared it against the original dataset online and found that the average score was actually a 524, not 52. In order to correct this mistake in my personal dataframe, which I named “sat_2017”, I ran the following line of code:

sat_2017.loc[20, ‘Math’] = 524

The process of cleaning data can involve a whole variety of other tasks besides just correcting typos. In fact, munging is often the bane of a data analyst’s existence. It’s usually the most time-consuming and meticulous step of the entire process, especially when you’re working with large datasets that contain hundreds of thousands or even millions of rows. For that reason, I can’t walk through every possible scenario concerning data cleaning, but I will share a couple more tasks that I had to perform while munging this data, starting with modifying column types.

Most columns in a dataframe can be described in layman’s terms as being either alphabetical or numerical (We use some more nuanced terms like objects, floats, integers, and others to describe certain variations of these categories, but it’s essentially just letters vs. numbers). If you look back at the previous image a couple paragraphs up, you will notice that the Participation column lists a percentage symbol, %, next to the numbers, whereas the other columns do not. This poses a problem in the coding world. With the exception of a decimal or occasionally a few other symbols, performing a mathematical computation on data that contains even a single letter or non-numerical character will generate an error. To resolve this issue, data must be reformatted in order to remove those non-numerical characters so that you can perform a computation on the column.

I removed the percentage symbols from the entire Participation column using the following line of code:

sat_2017[‘Participation’] = sat_2017[‘Participation’].map(lambda x: str(x).replace(‘%’,’’))

My next step was to recategorize the Participation column as a “float,” which is a type of numerical column that allows for decimals. You might be wondering why this was necessary, considering that I already removed the percentage symbols from the Participation column. Well, when a column’s data contains letters and other non-numerical symbols, that column is considered an “object.” It’s important to note that merely removing non-numerical symbols from an “object” column does not automatically reclassify the column; it will still be recognized as an “object” until a programmer converts the column to another type. The reason this is necessary is because mathematical computations, by default, cannot be performed on “object” columns, even if those “objects” are just comprised of numbers anyway. In other words, we must reclassify the column as numerical before we can perform computations on it. Ergo, with the percentage symbols removed, I was then able to convert the Participation column from an “object” to a “float” using the following line of code:

sat_2017[‘Participation’] = sat_2017[‘Participation’].astype(float)

This is what the dataframe looked like after removing the percentage symbols and then recategorizing the “object” column as a “float” column.

Note that attempting to convert the column type into a “float” without first removing the percentage symbols would have generated an error. I had to make the column consist of nothing but numerical values before I could recategorize it as a “float” column. (Interestingly, you can always go the other way around and convert a “float” column of numbers into an “object” column of random characters, if you so desire!)

The second-to-last part of the cleaning process was to rename the columns. This was not for personal preference or aesthetics. The rationale for this was that each column from all four dataset (2017 SAT, 2017 ACT, 2018 SAT, 2018 ACT) would need a unique name, because the final step would be to merge all four of these into a single dataframe. As column names cannot be repeated by default — not to mention to avoid ambiguity — I had to rename almost every column accordingly (I left the State column as is, because that’s the column on which I merged the dataframes). I used a simple pattern to rename each column with its year, exam, and subject using the following code:

sat_2017 = sat_2017.rename({‘State’: ‘state’, “‘Participation”: “‘17 sat_partic”, “‘Language”: “‘17 sat_language”,’Math’:”’17 sat_math”, “‘total”:”’17 sat_total”}, axis=’columns’)act_2017 = act_2017.rename({‘State’: ‘state’, “‘Participation”: “‘17_act_partic”, “‘English”: “‘17_act_english”, “‘Math”: “‘17_act_math”,”’Reading”: “‘17_act_reading”, “‘Science”: “‘17_act_science”, “‘Composite”: “‘17_act_composite”}, axis=’columns’)

After renaming the 2017 columns, I merged those two dataframes together as follows:

combined_17 = pd.merge(sat_2017, act_2017, on =’state’)

This is what a portion of the merged dataframe for 2017 testing data looked like at this point:

Everything I have described about data cleaning and organizing up until this point has been for the two 2017 datasets (SAT and ACT). The whole process had to be meticulously repeated for both of the 2018 datasets — identifying/correcting typos, reformatting data, recategorizing columns, renaming columns, and more. By this point I’m sure you can understand why I mentioned earlier that munging is the bane of an analyst’s existence!

After cleaning and organizing both of the 2018 datasets, the very final step was to merge the combined 2017 data with the combined 2018 data into a single dataframe, which I performed with the following code:

combined_all = pd.merge(combined_17, combined_18, on =’state’)

Exploratory Data Analysis (EDA) — Part 1

The third step of the data analysis/science process is often called Exploratory Data Analysis, or EDA for short, and it’s exactly what it sounds like. Now that the data has been cleaned and organized, this is the stage where an analyst will finally begin to explore it and search for trends, patterns, and even anomalies. Often times, especially in a business environment, an analyst might be given a raw set of data and just told to make sense of it, and that’s what this stage (and the next) are really about.

While making sense of an unfamiliar dataset could involve asking yourself dozens of questions until you find the “right” question(s) to base your analysis on, this particular academic project was not so convoluted. Considering the data was relatively small anyway, the most I realistically expected to identify was which states had the highest and lowest participation rates for each year of each exam, and whether there was a noticeable incline or decline in a specific performance from 2017 to 2018.

One surface-level analysis I almost always perform on any dataset involves identifying some of the most basic statistical values. These include the mean (average), maximum, minimum, median, and standard deviation for each numerical column (and sometimes the mode). In the case of this relatively simple project, this list includes most of what I was seeking to identify anyway, as I mentioned in the previous paragraph. I ran the following lines of Python code to determine these values and display them in an organized fashion:

print(“Means:”)
print()
print(combined_all.mean())
print()
print(“Medians:”)
print()
print(combined_all.median())
print()
print(“Maximums:”)
print()
print(combined_all.max())
print()
print(“Minimums:”)
print()
print(combined_all.min())
print()
print(“Standard Deviations:”)
print()
print(combined_all.std())
print()

The code’s output is quite lengthy, so let’s just use the Means section as an example. As you can see in the following image, my notebook displayed the average values for all of the numerical columns in my dataframe.

Notice that the State column is not included. This is because the State column is an “object,” and mathematical computations will only target numerical columns, including “floats” and “integers” among others. Hopefully now you can understand more intuitively why it was necessary to recategorize the Participation column during the munging stage of the project, as different data types require different treatments.

According to this data the average SAT participation nationwide was less than 41% of eligible students in 2017 and less than 47% in 2018. The average ACT participation was about 64% of eligible students in 2017 and less than 61% in 2018. The first thing that came to my mind, especially as a former classroom teacher, is that not all high schoolers apply to college. Additionally, most college applicants either take the SAT or the ACT but not both. Considering these important bits of information, it stands to reason that we shouldn’t expect these numbers to be anywhere near 100%. Depending on what my goal is in analyzing a given project, I’m going to dive into the data and examine it from different perspectives.

Hypothetically, let’s say I were contracted to do research for a nationwide company that creates study materials for college entrance exams, specifically in math. In that situation, it would be wise for me to take a closer look at general math performance on both exams and how it varies from state to state. In fact, I would probably be looking at much more data than just exam performance. Perhaps I would notice a correlation between states with lower math performances and states where the company has not invested as many resources marketing itself. If my client’s goal is simply to maximize profit, then my recommendation to them might be to market themselves more in those particular states, while continuing their current practices in the states that are performing well. For more advanced projects like that one, it’s crucial to think about what your goals or your client’s objectives are in order to determine how to best explore the data and arrive at the most informed conclusions.

Finally, there is one more specific component of my EDA that I want to share in this article. Unlike my previous calculations, which were performed simply by identifying aspects of the existing data, sometimes analysis requires creating new data to explore. Consider that the datasets used in this project contain performances from two different time periods. Under that circumstance, it would probably be wise to explore the changes in value over time for each category. Let’s use SAT Reading performance as an example. If I want to know the net change in performance, I would subtract the average SAT Reading scores of each state in 2017 from their scores in 2018. Using just the following line of code, I created a new column in my dataframe to indicate this net change in SAT Reading performance:

combined_all[‘SAT_reading_net_change’] = combined_all[“‘18_sat_language”] — combined_all[‘Evidence-Based Reading and Writing’]

With the new column in the table, you can clearly see that the net change in Alaska’s SAT Reading performance from 2017 to 2018 was 15 points, whereas Arkansas’ performance dropped by 22 points — yikes! When dealing with data that occurs over time, exploring the net change in value for every category can be very informative and provide you with the insight you need to draw valuable conclusions.

EDA Part 2 — Visualization

The EDA stage doesn’t usually require as much time as Data Cleaning, although it can definitely endure longer than the lifespan of that cockroach in your garage that refuses to die. However, despite all of the surface-level exploration and analyses that you can perform in the first part of this stage, sometimes creating visualizations can make it much easier to discover insights. Using simple graphs like histograms, bar graphs, line charts, and many others, you can gain a much better understanding of which data is important and why. Let’s take a look at some of the visualizations I used in this project.

A simple bar graph is great tool to compare performance values across all states, and the bar graph above displays the average SAT Math scores in 2018. Despite how crowded this graph is, visually it’s easy to spot the difference between high-performing states and low-performing states.

In addition to the fact that so much data might be too much for a single bar graph, sometimes other graphs are better suited to convey certain information anyway. Take a look at this next histogram, which incorporates the exact same data as the previous bar graph. This is much more useful if you’re interested in knowing how many states performed in different ranges of scores, rather than comparing individual states to each other. Histograms are excellent for breaking numerical data into groups and measuring the frequency of data points within each group.

Histograms can also be used to compare multiple sets of data to each other. Take a look at the following histogram as an example.

As you can see above, I compared 2017 math performances between the SAT and the ACT. This involved an extra step before I could create the histogram, because the SAT’s math section and the ACT’s math section are scored within different ranges (800 and 36, respectively). Consequently, in order to graph both sets of scores on the same histogram, I first had to convert both sets of scores into percentages. Looking at the chart once all has been said and done, the data seems to suggest that the ACT’s math section might be more challenging than its SAT counterpart. Only two states had an ACT math average above 70%, whereas seventeen states had SAT math averages above the same threshold. Exactly one state had an SAT math average below 60%, but a whopping twenty-nine states averaged below 60% on the ACT math. These are the types of insights that are significantly easier to make once you’ve used data visualization tools appropriately to assist you.

It’s worth noting that, if this project involved exam data for several years instead of just two years, then a line graph would be very useful to display changes over time. However, in the case of this project, viewing the net change of performance from 2017 to 2018 as a single, new column worked just fine.

Another useful visualization (which most people definitely don’t learn about in introductory statistics courses) is called a heat map. There are two kinds of heat maps — cluster heat maps and spatial heat maps. In this domain, cluster heat maps are much more common. As you can observe to the side, I used a heat map to visualize the correlations between each performance category and every other performance category. The darker the shade of blue, the higher the correlation.

Interpreting Data/ Recommendations

-To be continued…

A Beginner’s Guide to Data Analysis

Written by Alex S