Pandas cheatsheet for SQL people (part 1)
Pandas library is the de-facto standard tool for data scientists, nowadays. It is used widely by many data scientists around the globe. After being familiar with it I always use it for processing table-structured data whatever project I am working on. It works fast and reliable, supports CSV, Excel, JSON and so on.
However, as a person experienced in SQL, I had some difficulties and confusion with manipulating the tables (a.k.a. DataFrames) in the beginning. Eventually, I learned more APIs and ways of doing the things properly. I believe many people who do his/her first steps on Pandas may have the same experience. Some time ago I prepared the cheatsheet using SQL queries and their analogy in Pandas. And I am happy to share it with all of you.
In this post, I am sharing the queries using “SELECT”. I will post other queries and their analogies to Pandas soon later. I would be happy to receive your feedback and wish-list regarding the cheatsheet.
Here we go.
We have three simple tables:
- users
- courses
- attendance
And the rest of the queries will be about data manipulations on them.
Lets start from basic SELECT command.
SELECT * FROM users
SELECT * FROM users LIMIT 0,10
SELECT * FROM users WHERE email IS NULL
SELECT first_name, last_name FROM users
SELECT DISTINCT birth_year FROM users
Basic math and arithmetic operations.
SELECT AVG(points) FROM users;
SELECT SUM(points) FROM users;
Conditional operations and LIKE.
SELECT * FROM users WHERE birth_year BETWEEN 1998 AND 2018
SELECT * FROM users WHERE first_name LIKE 'Ch%'
SELECT * FROM users WHERE first_name LIKE '%es'
SELECT * FROM users WHERE first_name LIKE '%on%'
SELECT first_name, last_name FROM users WHERE first_name LIKE '%on%'
I finalize this post with JOIN and ORDER BY related SQL queries. To make the code more simpler and easy to read I will assign the results into new variables and re-use them in next ones.
SELECT * FROM attendance atn
LEFT JOIN users usr ON atn.user_id = usr.id
Now lets join the result above with course titles. The result it will be same as the result of following SQL command.
SELECT * FROM attendance atn
LEFT JOIN users usr ON atn.user_id = usr.id
LEFT JOIN courses co ON co.id = atn.course_id
The query above will return the full table/dataframe with all columns included. To select necessary columns we can use the same method as we did before.
SELECT * FROM users ORDER BY first_name, last_name
SELECT * FROM users ORDER BY first_name, last_name DESC
SELECT first_name, last_name, birth_year,
points, course_name, instructor FROM attendance atn
LEFT JOIN users usr ON atn.user_id = usr.id
LEFT JOIN courses co ON co.id = atn.course_id
ORDER BY first_name, last_name