Your First Ten minutes of analyzing data with Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
It provides a relatively simple and quite hassle-free way to get started with almost any kind of dataset analysis, be it a simple dataset for completing a college assignment or a massive dataset from a Kaggle competition.
In this post, we will be looking into the very bare bones of how to use this library in the very first ten minutes of your analysis.
To follow along, you’ll need to have Jupyter notebook (If you’re still skeptical about using Jupyter notebooks, then remember that according to a worldwide survey of data analysts around the globe, Jupyter notebook was rated as number three in their most useful tools at work). The [official documentation](http://jupyter.readthedocs.io/en/latest/install.html) does a pretty good job of setting up the notebook. Next, you’ll need a dataset to analyze. Here, we will use the dataset provided by Kaggle for [
House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).
Once you’ve the dataset downloaded, you should be able to notice a train.csv
file. This is the training set for the analysis — the dataset with which we get started with our preliminary analyses.
Once you have your Jupyter notebook running, you’ll want to import the pandas library. Data analysts usually prefer to shorten the name pandas
to pd
, just to shorten the To import the library, you type in the following line of code —
import pandas as pd
The very first thing to examine is a brief summary of the dataset. pandas provides a handy describe()
method to accomplish this. This outputs a very concise summary of the number of non-zero values, mean, standard deviation etc. for each of the columns in the dataset. The usefulness of this lies in the ability to shed light on clear outliers, or too many missing values.
Suppose, you have the file train.csv
in the same directory as your Jupyter notebook, you can type in the following lines of code —
data = pd.read_csv("train.csv")
print(data.describe())
Always remember that using the read_csv
function returns a data structure called a Dataframe.
The next basic step is to analyze the columns in the dataset. The names of the columns can be accessed using the columns
attribute of pandas dataframe —
data_columns = data.columns# This would print out the keys for each of the columns
print(data_columns)
Finally, you’ll also want to examine a particular column or a set of columns, based on the output you obtain from calling the describe()
method. If you’ve been following along in your Jupyter notebook, you must have gotten an output listing out all the columns in the dataset after running the code above.
Now, let’s say you only want to examine a single column — say the price at which the house sold. You can access a particular column using the dot notation, like so —
price_data = data.SalePrice
print(price_data.head())
This returns a pandas Series. You can think of a Series as basically a dataframe, but with only one column.
Note that the part after the dot above (i.e. “SalePrice) should exactly match with the key that is provided to the particular column in the dataset. The head()
function outputs only a few of the top entries. That’s useful if you’re just interested in seeing what the output for a particular field looks like.
You can also examine the outputs from multiple columns. Just remember to pass along the keys of the columns as strings in a list —
cols_to_examine = ['SalePrice','LotArea','Street']
interested_cols_data = data[cols_to_examine]
You can verify that this indeed returns only the columns that we asked for by calling the describe()
method —
interested_cols_data.describe()
So, that’s how you go about doing a very preliminary investigation of a dataset using pandas. If you’ve any questions regarding these, do leave them behind in the comments section and I’d be glad to help. Happy coding!