Explained: Writing CSV files to Excel with Pandas
Easy File handling with Pandas
An Online CSV to an Excel File
Excel remains one of the most popular spreadsheet applications. Fortunately, the Python Pandas library can work with Excel, permitting easy import and export of tabular data.
This tutorial will use a small tennis csv file that I scraped from the ATP tour webpage on former World Number Ones from the past 2 decades.
Which is More Promising: Data Science or Software Engineering? | Data Driven Investor
About a month back, while I was sitting at a café and working on developing a website for a client, I found this woman…
Read a CSV File from an Online Source
To make things a little more interesting, I wanted to kick-off this tutorial with an illustration that the pandas read_csv method is not confined to only taking a CSV file in our local working directory. Rather, a url can be specified to the read_csv method as long as it points to a CSV file.
In this case, anyone would like to follow along with this tutorial, the link to my small tennis dataframe can be found on my github page here.
I simply take this url and assign it to the variable URL as a string. I then pass it to the read_csv method and view the first 5 rows of my Dataframe using the head method. I assign this Dataframe, the variable name, tennis. The steps that I have undertaken are depicted below.
To clean up the scraped data, I use the drop method to remove the irrelevant ‘Unnamed:7’ column from my Dataframe. To make this change permanent, I also need to specify the argument True to the inplace parameter, in addition to providing the axis parameter the string argument ‘columns’.
Read_csv and the encoding parameter
In addition, should the csv file that you are attempting to download possess special characters, for example German Alphabet characters such as Umlauts and ‘ß’, the read_csv method also has an additional encoding parameter which can accept a string argument of ‘uft-8’. By supplying this argument to the enconding parameter, the read_csv method can usually parse these characters, and prevent an error being thrown.
-and a bonus tip
As a pre-requisite, in order to use Pandas to write Dataframe objects to Excel, it is necessary to install 2 libraries. These are the xlrd and openpyxl libraries respectively. For the sake of convenience, these libraries can be installed from the comfort of the Jupyter Notebook, by simply prefixing the command with a ! sign followed by pip install. When this cell is executed, the output will be either ‘Requirement already satisfied’ or the installation will automatically take place.
Creating 2 Dataframes to export to Excel
In order to showcase Pandas capability in exporting Dataframes, I have decided to split the tennis Dataframe, into 2 smaller dataframes.
The first Dataframe, called Top_half, are the players who spent more than 40 weeks as the World Number 1. The second Dataframe, named Bottom half, are the rest of the tennis players who have less than or equal to 40 weeks in the World Number 1 position.
To do this, I simply extracted the ‘Total_Weeks_at_No_1’ Series from the original tennis dataframe, and used the comparison operator to produce a boolean series which indicates whether players were in the world number one position for more, or less than or equal to 40 weeks. I then pass these boolean Series to the original tennis Dataframe to extract, and save them to appropriately named variables, such as Top_half, and Bottom_half as shown in the Jupyter Notebook still below.
From Pandas Dataframe to Excel in 3 Steps
- To begin the process of exporting Pandas Dataframes to excel, it is neccessary to create a ExcelWriter object. This is achieved using the ExcelWriter method which is called directly on the pandas library. Within this method, I specify the name of the Excel file (here, I chose ‘Tennis_players’) and include the .xlsx extension. This step creates the central workbook, that we can then write our Dataframes to.
- Following this, I call the .to_excel method on my 2 dataframes that I created earlier. Within the to_excel method, the first argument to specify is the ExcelWriter object, followed by an ‘optional’ sheetname parameter. I also choose to exclude the index, by setting the index argument from the default of True to False, and supply the ‘utf-8’ argument to the encoding parameter to handle any special characters. I do these steps for both the Top_half and Bottom_half dataframes respectively.
- Finally, now that my Dataframes are queued up to be exported, I call the save method on my ExcelWriter object, which I have assigned to the variable ‘my_excel_file’.
As we are in the Jupyter Notebook, a trick I like to use is the cell magic %ls. This command lists the files in my working directory. As shown, ‘Tennis_players.xlsx’ is now present.
View in Excel
To confirm the export worked as intended, lets open this file in Excel, and view the sheetnames or tabs as well. Everything has worked as expected. The tennis players with more than 40 weeks as World Number 1 are in the Top_half sheetname, while those with less than or equal to 40 weeks at World Number 1 are in the Bottom_half sheet in the ‘Tennis_players.xlsx’ file.
Conveniently, Pandas also automatically bolds the column headers, which give a nice finish to the exported Dataframes.
The original CSV file that I downloaded using a url was read into Pandas as a Dataframe object. Susequently, this Dataframe was split into 2 Dataframes and exported to Excel, with an appropriate file name ‘Tennis_players’ with the extension .xlsx, with the 2 Dataframes assigned to different tabs.