In this post, I will outline a strategy to ‘learn pandas’. For those who are unaware, pandas is the most popular library in the scientific Python ecosystem for doing data analysis. Pandas is capable of many tasks including:
- Reading/writing many different data formats
- Selecting subsets of data
- Calculating across rows and down columns
- Finding and filling missing data
- Applying operations to independent groups within the data
- Reshaping data into different forms
- Combing multiple datasets together
- Advanced time-series functionality
- Visualization through matplotlib and seaborn
Become an Expert
- My book Master Data Analysis with Python is the most comprehensive text on the market to learn data analysis using Python and comes with 300+ exercises and projects.
- Sign-up for the FREE Intro to Pandas class
- Follow me on Twitter @TedPetrou for my daily data science tricks
Although pandas is very capable, it does not provide functionality for the entire data science pipeline. Pandas is typically the intermediate tool used for data exploration and cleaning squashed between data capturing and storage, and data modeling and predicting.
For a typical data scientist, pandas will play the largest role as the data traverses through the pipeline. One metric to quantify this is with the Stack Overflow trends app.
Currently, pandas has more activity on Stack Overflow than any other Python data science library and makes up an astounding 1% of all new questions submitted on the entire site.
Stack Overflow Overuse
From the chart above, we have evidence that many people are using and also confused by pandas. I have answered about 400 questions on pandas on Stack Overflow and see first hand how poorly understood the library is. For all of the greatness that Stack Overflow has bestowed upon programmers, it comes with a significant downside. The instant gratification of finding an answer is a massive inhibitor for working through the documentation and other resources on your own. I think it would be a good idea to dedicate a few weeks each year to not using Stack Overflow.
Step-by-Step Guide to Learning Pandas
To begin, you should not actually have a goal to ‘learn pandas’. While knowing how to execute the operations in the library will be useful, it will not be nearly as beneficial as learning pandas in ways that you would actually use it during a data analysis. You can segment your learning into two distinct categories:
- Learning the pandas library independent of data analysis
- Learning to use pandas as you would during an actual data analysis
The difference between the two is like learning how to saw a few small branches in half versus going out into a forest and sawing down some trees. Let’s summarize these two approaches before getting into more detail.
Learning the Pandas library independent of data analysis: This approach will primarily involve reading, and more importantly, exploring, the official pandas documentation.
Learning to use Pandas as you would during an actual data analysis: This approach involves finding or collecting real-world data and performing an end-to-end data analysis. One of the best places to find data is with Kaggle datasets. This is not the machine learning component of Kaggle, which I would strongly suggest you avoid until you are more comfortable with pandas.
During your journey to learn how to do data analysis with pandas, you should alternate between learning the fundamentals from the documentation and their application in a real-world dataset. This is very important, as its easy to learn just enough pandas to complete most of your tasks, and then to solely rely on these basics far too heavily when more advanced operations exist.
Begin with the Documentation
If you have never worked with pandas before, but do have an adequate grasp of basic Python, then I suggest beginning with the official pandas documentation. It is extremely thorough and at its current state, 2,195 pages (careful, link is to full pdf). Even with its massive size, the documentation doesn’t actually cover every single operation and certainly doesn’t cover all the different combinations of parameters you can use within pandas functions/methods.
Getting the Most out of the Documentation
To get the most out of the documentation, do not just read it. There are about 15 sections of the documentation that I suggest covering. For each section, create a new Jupyter notebook. Read this blogpost from Data Camp if you are unfamiliar with Jupyter notebooks.
Your First Jupyter Notebook
Begin with the section, Intro to Data Structures. Open this page alongside your Jupyter notebook. As you read through the documentation, write the code (don’t copy it) and execute it in the notebook. During the execution of the code, make sure to explore the operations and attempt new ways to use them.
Continue with the section Indexing and Selecting Data. Make a new Jupyter notebook and again write, execute, and explore the different operations that you learn. Selecting data is one of the most confusing aspects for beginning pandas users to grasp. I wrote a lengthy Stack Overflow post on
.iloc which you may want to read for yet another explanation.
After these two sections, you should understand the components of a DataFrame and a Series and know how to select different subsets of data. Now read 10 minutes to pandas to get a broader overview of several other useful operations. As with all sections, make a new notebook.
Press shift + tab + tab to get Help in a Jupyter Notebook
I am constantly pressing
shift + tab + tab when using pandas in a Jupyter notebook. When a cursor is placed inside the name, or in the parentheses that follow any valid Python, the documentation for that object pops out into a little scrollable box. This help box is invaluable to me since its impossible to remember all the different parameter names and their input types.
You can also press
tab directly following a dot to have a dropdown menu of all the available objects
If you are enjoying this article, consider purchasing the All Access Pass! which includes all my current and future material for one low price.
Major Downside of the Documentation
While the documentation is very thorough, it does not do a good job at teaching how to properly do a data analysis with real data. All the data is contrived or randomly generated. Also, real data analysis will involve multiple pandas operations (sometimes dozens) strung together. You will never get exposure to this from the documentation. The documentation teaches a mechanical approach to learning pandas, where one method is learned in isolation from the others.
Your First Data Analysis
After these three sections of the documentation, you will be ready for your first exposure to real data. As mentioned previously, I recommend beginning with Kaggle datasets. You can sort by most voted to return the most popular ones such as the TMDB 5000 movie dataset. Download the data and create a new Jupyter notebook on just that dataset. It is unlikely that you will be able to do any advanced data processing at this point, but you should be able to practice what you learned in the three sections of the documentation.
Look at the Kernels
Every Kaggle dataset has a kernels section (movie dataset kernels). Don’t let the name ‘kernel’ confuse you — its just a Jupyter notebooks created by a Kaggle user in Python or R. This will be one of your best learning opportunities. After you have done some basic analysis on your own, open up one of the more popular Python kernels. Read through several of them and take pieces of code that you find interesting and insert it into your own notebook.
If you don’t understand something, ask a question in the comments section. You can actually create your own kernel, but for now, I would stick with working locally in your notebooks.
Going Back to the Documentation
Once you have finished your first kernel, you can go back to the documentation and complete another section. Here is my recommended path through the documentation:
- Working with missing data
- Group By: split-apply-combine
- Reshaping and Pivot Tables
- Merge, join, and concatenate
- IO Tools (Text, CSV, HDF5, …)
- Working with Text Data
- Time Series / Date functionality
- Time Deltas
- Categorical Data
- Computational tools
- MultiIndex / Advanced Indexing
This order is significantly different than the order presented on the left-hand-side of the home page of the documentation and covers the topics I think are most important first. There are several sections of the documentation that are not listed above, which you can cover on your own at a later date.
After completion of these sections of the documentation and about 10 Kaggle kernels, you should be well on your way to feeling comfortable both with the mechanics of pandas and actual data analysis.
Learning Exploratory Data Analysis
By reading many popular Kaggle kernels, you will learn quite a lot about what makes a good data analysis. For a more formal and rigorous approach, I recommend reading chapter 4, Exploratory Data Analysis of Howard Seltman’s online book.
Creating your own Kernels
You should consider creating your own kernels on Kaggle. This is an excellent way to force yourself to write clean and clear Jupyter notebooks. It is typical to create notebooks on your own that are very messy with code written out of order that would be impossible for someone else (like your future self) to make sense of. When you post a kernel online, I would suggest making it as if you expected your current or future employer to read. Write an executive summary or abstract at the top and clearly explain each block of code with markdown. What I usually do, is make one messy, exploratory notebook and an entire separate notebook as a final product. Here is a kernel from one of my students on the HR analytics dataset.
Don’t just Learn Pandas; Master it
There is a huge difference between a pandas user who knows just enough to get by and a power user who has it mastered. It is quite common for regular users of pandas to write poor code, as there is quite a substantial amount of functionality and often multiple ways to get the same result. It is quite easy to write some pandas operations that get your result but in a highly inefficient manner.
If you are a data scientist that works with Python, you probably already use pandas frequently, so making it a priority to master it should create lots of value for you. There are lot’s of fun tricks available as well.
Test your Knowledge with Stack Overflow
You don’t really know a Python library if you cannot answer the majority of questions on it that are asked on Stack Overflow. This statement might be a little too strong, but in general, Stack Overflow provides a great testing ground for your knowledge of a particular library. There are over 50,000 questions tagged as pandas, so you have an endless test bank to build your pandas knowledge.
If you have never answered a question on Stack Overflow, I would recommend looking at older questions that already have answers and attempting to answer them by only using the documentation. After you feel like you can put together high-quality answers, I would suggest making attempts at unanswered questions. Nothing improved my pandas skills more than answering questions on Stack Overflow.
Your own Projects
Kaggle kernels are great, but eventually, you need to tackle a unique project on your own. The first step is finding data, of which there are many resources such as:
- NYC open data, Houston open data, Denver open data — most large American cities have open data portals
After finding a dataset you want to explore, continue the same process of creating a Jupyter notebook and when you have a nice final product, post it on GitHub.
In summary, use the documentation to learn the mechanics of pandas operations and use real datasets, beginning with Kaggle kernels, to learn how to use pandas to do data analysis. Finally, test your knowledge with Stack Overflow.
Get the All Access Pass!
Get all of my current and future material for one low price with the All Access Pass! The primary courses available are the following: