Getting to Know the Pandas DataFrame Object

This blog post covers the Python Pandas DataFrame object. Code examples show ways to create one, subset data, explore data and plot it using the matplotlib package.

What is Pandas?

pandas is a package for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It works particularly well with Jupyter Notebooks, where you can also use bash commands, magic commands, plotting capabilities and take advantage of a nice overall presentation of code, visuals and comments.

What is a DataFrame object?

Pandas offers two primary data structures: Series and the DataFrame objects. Whereas a Series represents a one-dimensional labeled indexed array based on the NumPy ndarray, a DataFrame object treats tabular (and multi-dimensional) data as a labeled, indexed series of observations. You can compare a DataFrame with a spreadsheet Excel or a relational database table. If you use R, this will look very familiar as R also uses Data Frames. You can use DataFrames for organizing data or exploratory data analysis.

Creating a DataFrame Object

The following code loads the pandas package, reads a csv file, applies a tab as a separator and prints the DataFrame object inside an IDE. If you´re running the same code in a Jupyter Notebook, you´ll notice that the cells have a neat layout with borders, lacking in an IDE. The JN also prints the total number of rows and columns underneath the DataFrame.

>>Import pandas
>>df = Pandas.read_csv(r”c:\data\myfile.csv”, sep=’\t’)
>>df.shape # prints the amount of rows and column numbers
>>df.columns # prints column names of dataset
>>df.dtypes #lists data types of all columns

Subsetting Data in a DataFrame

If DataFrames are too large to work with, or if you´re only interested in a subset of the data, Pandas offers a number of ways to subset your data:

>>column_values_df = df[‘name_of_column’] # a way to subset one column. This command prints all values for one column
>>subset = df[[‘column#1’, ‘column#2’, ‘column#3’]] # Subsetting multiple columns at once. Note the double brackets.
>>df_subset = df[list(range(1, 3))] # subsets only columns 2 and 3, using the range function that goes along with the list function in Python 3. Column names are zero-based, so here the first column is skipped here.
>>df.loc[0] # subsets the first row. If there are no row numbers, pandas will automatically number them for you in a DF, starting with 0.
>>df.shape[0] # returns the total row amount
>>df.loc[df.shape[0] -1] # returns the latest row in the DataFrame. Loc uses row name.
>>df.iloc[8] # returns row number
>>df.ix[[0, 10, 100]] # subset multiple rows, by row name or integer numbers
>>df.ix[8, ‘year’] # subset one particular cell value in a row

Data Exploration with a DataFrame

After creating a DataFrame and subsetting the data, you might want to explore the data to have and do some basic plotting, concatenate different data frames, rename column names or write the new DataFrame to a .csv file.

>>grouped = df.groupby([‘year’, ‘country’])[‘pop’].mean() # calculates the mean population of a country by year.
>>import matplotlib.pyplot as plt # imports the matplotlib package for plotting
>>%matplotlib inline # magic command in Jupyter Notebook to enable use of plotting inside of a cell.
>>df.groupby(‘year’)[‘pop’].mean().plot() # perform some simple plotting
>>import pandas as pd
>>df1 = pd.read_csv(r”C:\data\myfile_1.csv”)
>>df2 = pd.read_csv(r”C:\data\myfile_2.csv”)
>>df3 = pd.read_csv(r”C:\data\myfile_3.csv”)
>>row_concat = pd.concat([df1, df2, df3]) #concatenates all three data frames by row
>>row_concat.shape # prints (x,y), meaning x rows, y columns
>>col_concat = pd.concat([df1, df2, df3], axis=1) # concatenate by column, axis=1 means columns
>>concatenated.to_csv(‘concatenated.csv’) #saves the new DataFrame to a .csv file
>>ls #this is a bash command to check if file has been made. It prints the current working directory and file structure inside a Jupyter Notebook cell.

Pandas website: