Pandas Practice Series: Top 5 States with 5-Star Businesses

Rohit Arora
CodeX
Published in
3 min readSep 5, 2022

Level: Medium | Company: Yelp

Image by @mailchimp from Unsplash

Problem Description:

Find the top 5 states with the most 5-star businesses. Output the state name and the number of 5-star businesses and order records by the number of 5-star businesses in descending order. If there are ties in the number of businesses, return all the unique states. If two states have the same result, sort them in alphabetical order.

The question and data are taken from Stratascratch, a site that collects essential interview questions and structures them according to their difficulty level and the company it was asked in.

Data Preview:

yelp_business.head()

Output:

Column definitions:

business_id:object
name:object
neighborhood:object
address:object
city:object
state:object
postal_code:object
latitude:float64
longitude:float64
stars:float64
review_count:int64
is_open:int64
categories:object

Solution:

Let us tackle this question in a step-by-step manner.

Given that, we need to return the top 5 states with the highest count of businesses that have 5 star reviews. This question can be broken into parts:

First, filter out the 5-star businesses

Second, count the number of businesses within each state.

We need to find out whether business_id is a primary key for the table, and that can be done by comparing the size of the dataset with that of the unique number of values in business_id, and if the size is the same, it can be considered the primary key.

Now that we know business_id is a primary key, we can count the number of businesses in each state and sort the resulting data accordingly. Now the question says that you have to sort the states in descending order, i.e., the state with the highest number of 5-star businesses should be the first, and the state with the lowest number of 5-star businesses should be the last. And if there is a tie, then the state should be sorted alphabetically or simply in ascending order.

So we get the correct answer, but the last bit can be done in such a way that gives us an opportunity to learn about a pandas function called Rank. Pandas Dataframe.rank() method returns a rank of every respective index of a series passed. The rank is returned on the basis of position after sorting. So here is how we would do it:

Output:

The above snippet gives rank to each row based on the count of businesses in each state, and now we can just keep the rows with a rank less than 5 and sort accordingly.

Some important concepts we went through:

We can sort the data frame based on multiple columns by mentioning an array of column names in the by parameter of the sort_values function. And we can also decide its ordering by setting ascending parameter as True or False.

We also learned how to use rank function to rank our rows based on column values.

That's it for this post, but don't forget practice is the key to mastering the art of problem-solving, so stay tuned for more such content!

And if you are interested in getting your daily dose of ML, DL, and python content in the form of bite-sized chunks so that you can upgrade your knowledge even in your free time, check out my Twitter account👇

--

--