Photo by Pascal Bernardon on Unsplash

Make the Most Out of your pandas.read_csv()

Melissa Rodriguez
Analytics Vidhya
Published in
7 min readDec 15, 2019

--

When I started to learn to code in Python and using pandas library, my main focus was to get the data imported and then get to work with it. I did not pay attention to all the parameters available along with the pandas.read_csv() or pandas.read_excel() functions. Then, after importing, I would need to drop and/or rename columns, reset indexes, and changing datatypes among other things just because I was not aware of the parameters available that can save you time and let’s be honest… it just looks better!

Steps like dropping and renaming columns, setting indexes, changing data types, and apply functions to columns are some of the cool things you can do while importing the data.

Let’s take a look:

Here is the csv file and code I tried first to import the fertility rate data used for my previous blogs:

csv file to be imported

For my analysis I want to use all columns except the ones named Indicator Name and Indicator Code. Also the column for 2018 year is empty so I do not need it as well.

#import pandas library
import pandas as pd
#import fertility rate data
df = pd.read_csv('data/API_SP.DYN.TFRT.IN_DS2_en_csv_v2_41035.csv', skiprows = 4)
#remove unnecesary columns:
df = df.drop(columns = ['Indicator Name','Indicator Code','Unnamed: 63','2018'])
#renaming columns
df.rename(columns={'Country Name':'CountryName',
'Country Code':'CountryCode3'},
inplace=True)
df.head()

As you can see, in the code above, the following steps were done:

  1. import data
  2. dropped columns
  3. rename columns

Now let’s see an updated version of the code with the same results:

#import pandas library
import pandas as pd
from fr_data_clean import cols_to_use
#create a list of the column names to import
names = ['CountryName','CountryCode']
names.extend(list(map(lambda x: str(x), list(range(1960,2018)))))
#import fertility rate data
df = pd.read_csv('data/API_SP.DYN.TFRT.IN_DS2_en_csv_v2_41035.csv',
header = 2,
usecols= cols_to_use(63,[2,3,62]),
names=names)

Now for the second code, I took advantage of some of the parameters available for pandas.read_csv()

header & names

header: It allows you to set which row from your file will be used as the column names for your dataframe as well as the start point of the data.

  • It expects an int or a list of int values
  • Default value is to infer header = 0, which assumes the first row with data has the column names.
  • If your file do not have a header set header=None

names: Used to set the names for the dataframe columns.

  • If the file contains no header row, header must be set to None
  • It expects a string or list of strings
  • Can be used to replace the column names while importing the data when used in combination with header parameter.

Examples:

Using header parameter:

csv file where column names are in the second row
#importing data without header setting
df = pd.read_csv('data.csv')
dataframe with incorrect header
#setting header parameter to the second row
df = pd.read_csv('data.csv', header = 1)
dataframe imported with correct header setting

Using names parameter:

File without header names
#assigning column names for data with no headers
df = pd.read_csv('inventory.csv', header=None,
names=['model','size','width','color','qty'])
dataframe with assigned column names during

Renaming columns when importing the file:

Let’s say that you would like to have different names for your columns, you can make that change while importing the data:

Data to import
#changing column names while importing data
df = pd.read_csv('all.csv' , header=0,
names=['Country','Code2','Code3','Code','Region'])

***It is important to pass the row for the existing header, otherwise it will add the header in the file as a data row.

usecols

When importing a file, you can use this parameter to only import those columns that you care about.

  • It expects string or list of strings if using column names or it can be an integer or list of integers if using column index position.
  • Make sure you pass the exact names or numbers or it will error out.

Examples:

For the file in image shown above, these two versions of the code will import the same columns:

#using column position
df = pd.read_csv('diabetes.tab.txt' , sep='\t', usecols = [0,1,2,3])
#using column names
df = pd.read_csv('diabetes.tab.txt' , sep='\t',
usecols = ['AGE','SEX','BMI','BP'])

Other parameters:

index_col

This parameter allows you to set which column will be used as an index.

  • Default value is None.
  • It expects string or list of strings if using column names or it can be an integer or list of integers if using column index position.
  • If index_col = False then it forces pandas to not use the first column as the index.

Example:

For this example I have a file that I created by exporting a pandas dataframe to a csv file. This data includes an index column:

Now, if I import the data without specifying the index column, I get a “Unnamed 0” column that I will need to drop later.

df = pd.read_csv('FR_2017.csv')
df.head()

Instead, I will use index_col parameter to avoid this extra column and extra work!

df = pd.read_csv('FR_2017.csv',
index_col = 0)
df.head()

nrows

Number of rows to read from the file. Useful to get a preview of a large file.

  • It expects an int value

Example:

Get the first 20 records of the diabetes.tab.txt file:

df = pd.read_csv('diabetes.tab.txt' , sep='\t' , nrows = 20)
df.info()

converters

Converters allows you to pass a dictionary of functions to transform values on specified columns. Isn’t this amazing?

Example:

Let’s look at one of the files of the UFC Fight Historical dataset:

This a snip of the raw_fighter_details.csv file.

By looking at it, I would like to remove the ‘lbs.’ from the Weight column and remove the ‘ " ’ from the Reach column as well. Also I will only bring the first four columns .

#creating functions to clean the columnsw = lambda x: (x.replace('lbs.',''))
r = lambda x: (x.replace('"',''))
#using converters to apply the functions to the columnsfighter = pd.read_csv('raw_fighter_details.csv' ,
converters={'Weight':w , 'Reach':r },
header=0,
usecols = [0,1,2,3])
fighter.head(15)

parse_dates

Used to parse columns as datetime type.

  • It can be a Boolean value, a list of int for column position or a list of strings for column names
  • Default value is False
  • If a column or index cannot be represented as an array of datetimes the column or index will be returned unaltered as an object data type.

Example:

Importing data without using parse_dates:

fighter = pd.read_csv('raw_fighter_details.csv' , 
converters={'Weight':w , 'Reach':r },
header=0,
usecols = [0,1,2,3,5])
fighter.info()

Applying parse_dates to DOB column:

fighter = pd.read_csv('raw_fighter_details.csv' , 
converters={'Weight':w , 'Reach':r },
header=0,
usecols = [0,1,2,3,5],
parse_dates = ['DOB'])
fighter.info()

dtype

Allows you to define columns data types while importing the data.

  • Expects a type name if applying the same type to all columns or a dictionary with column name as the key and the type as the value.
  • If converters are specified, they will be applied INSTEAD of dtype.

Examples:

Assigning object type to all columns:

df = pd.read_csv('inventory.csv', header=None, 
names='model','size','width','color','qty','code'],
dtype=object)
df.info()

Assigning different datatypes to multiple columns:

df = pd.read_csv('inventory.csv', header=None, 
names='model','size','width','color','qty','code'],
dtype = {'model':object,'size':float,
'width':object,'color':object,
'qty':int,'code': object})
df.info()

verbose

Display additional information about the time it takes to convert the file into a dataframe. Information like tokenization, conversion and memory clean up will be displayed.

  • Expects a Boolean value.
  • Default value is False.

Example:

Let’s redo the fighter data import but with the verbose parameter set to True:

In my experience, in the hurry of getting things done, I have found myself not reading the documentation unless I am stuck with an issue. I am working on shaking this behavior off as I continue to learn and practice Python. It is of great benefit for my projects and learning process to spend the extra time learning the parameters that come along with our favorite or most used functions.

Thank you for reading!

--

--