Analyzing Yelp Dataset with SQL

Sanchita Deb
Analytics Vidhya
Published in
7 min readMar 10, 2021
Photo by Markus Spike on Unsplash

SQL is a key cog in a data science professional’s armory. A Analyst needs SQL to handle structured data stored in relational databases

SQL has been there around for decades. The rate at which we are delivering and devouring information is soaring day by day. To make smart decisions based on data, organizations around the world are hiring data professionals like business analysts and data scientists to mine and unearth insights from the vast treasure trove of data.

And one of the most important tools required for this is — you guessed it — SQL!

I recently took up a course in SQL (Structured Query Language) made by University of California, UC Davis from coursers.org. This is a preliminary in the fundamentals of SQL and working with data so that you can begin analyzing it for data science purposes. You will begin to ask the right questions and come up with good answers to deliver valuable insights and important bits of knowledge for your organization. This course begins with the nuts and bolts and accepts you don’t have any information or skills in SQL.

For people who want to learn SQL and haven’t taken this course yet, I highly recommend this course.

Here are the learning objectives from this course

  • Distinguish between use of SQL for data science applications and SQL for more common data management operations.
  • Identify a subset of data needed from a column or set of columns and write a SQL query to limit to those outcome.
  • Understanding how WHERE, BETWEEN, IN, OR, NOT, LIKE, ORDER BY, and GROUP BY works. using the wildcard function to search for more specific or parts of records, including their advantages and disadvantages, and how best to utilize them
  • Using basic math operators, as well as aggregate functions like AVERAGE, COUNT, MAX, MIN, and others to begin analyzing data.
  • Learning how to write subqueries, including their advantages and disadvantages, and when to use them.
  • Identifying and defining several types of JOINs, including the Cartesian join, an inner join, left and right joins, full outer joins, and a self join. Also, using aliases and pre-qualifiers to make your SQL code cleaner and efficient.
  • Manipulating strings, dates, and numeric data using functions to integrate data from different sources into fields with the correct format for analysis.
  • Using Case / When statements to recode a set of data for grouping at a different level (e.g. cities to regions).
  • Identifying organizational, governance, business, and data conditions that indicate use of a join to prepare data for analysis.

The final project for this course was analyzing the public dataset provided by Yelp, a platform for users to provide reviews and rate their interactions with a variety of organizations — businesses, restaurants, health clubs, hospitals, local governmental offices, charitable organizations, etc.

Below is the Entity Diagram (ER Diagram) of the Yelp dataset that we are going to use to analyze the data.

Let me explain you the Entity Relationship Diagram.

Any object, for example, entities, attributes of an entity, relationship sets, and attributes of relationship sets, can be represented with the help of an ER diagram.

  • The Yellow key icons in tables (for example -id in business table) is the primary key of business table
  • The red diamond icons are the Foreign keys of another table (for example business_id in review table is the foreign key of business table .

Note : If you are not familiar with the idea of Primary and Foreign Keys. Check what are [Primary and Foreign Keys]

  • The Dashed-lines running across tables shows the relationship of tables with each other ( for example the business table has one to many relationship with other tables)

Now we know what SQL does and what is a ER diagram, lets try to dive in a little bit deeper and see how we can retrieve data and solve some business problems.

The First thing we do after getting the data is to check whether how many records are there and whether there are any null values in the column which we are going to use.

  1. Finding the total number of records in Business table.
***********SQL CODE**********
SELECT COUNT(*)
FROM Business
*****************************

2. Check if there are any null values in User table

***********SQL CODE**********
SELECT * FROM user WHERE
id is null or
name is null or
review_count is null or
yelping_since is null or
useful is null or
funny is null or
cool is null or
fans is null or
average_stars is null or
compliment_hot is null or
compliment_more is null or
compliment_profile is null or
compliment_cute is null or
compliment_list is null or
compliment_note is null or
compliment_plain is null or
compliment_cool is null or
compliment_funny is null or
compliment_writer is null or
compliment_photos is null
*****************************

3. List the cities with the most reviews in descending order:

***********SQL CODE**********
SELECT city,SUM(review_count)
FROM business
ORDER BY review_count DESC
*****************************
-----------------+-------------------+
| city | SUM (review_count) |
+-----------------+-------------------+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montréal | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
+-----------------+-------------------+

4. Find the distribution of star ratings to the business in Avon city

***********SQL CODE**********
SELECT stars,COUNT (stars) AS star_count
FROM business
WHERE city='Avon'
GROUP BY 1
*****************************
+-------+------------+
| stars | star_count |
+-------+------------+
| 1.5 | 1 |
| 2.5 | 2 |
| 3.5 | 3 |
| 4.0 | 2 |
| 4.5 | 1 |
| 5.0 | 1 |
+-------+------------+

5.Find the top 3 users based on their total number of reviews:

***********SQL CODE**********
SELECT name,review_count
FROM user
ORDER BY review_count DESC
LIMIT 3
*****************************
+--------+--------------+
| name | review_count |
+--------+--------------+
| Gerald | 2000 |
| Sara | 1629 |
| Yuri | 1339 |
+--------+--------------+

6.Does posing more reviews correlate with more fans?

Not necessarily correlated. Gerald, who has the most fans, only has 253 reviews. Yuri has only 76 fans, but has the 1339 reviews. Some other factors should also be considered.

***********SQL CODE**********
SELECT name, review_count, fans
FROM USER
ORDER BY fans DESC
*****************************
+-----------+--------------+------+
| name | review_count | fans |
+-----------+--------------+------+
| Amy | 609 | 503 |
| Mimi | 968 | 497 |
| Harald | 1153 | 311 |
| Gerald | 2000 | 253 |
| Christine | 930 | 173 |
| Lisa | 813 | 159 |
| Cat | 377 | 133 |
| William | 1215 | 126 |
| Fran | 862 | 124 |
| Lissa | 834 | 120 |
| Mark | 861 | 115 |
| Tiffany | 408 | 111 |
| bernice | 255 | 105 |
| Roanna | 1039 | 104 |
| Angela | 694 | 101 |
| .Hon | 1246 | 101 |
| Ben | 307 | 96 |
| Linda | 584 | 89 |
| Christina | 842 | 85 |
| Jessica | 220 | 84 |
| Greg | 408 | 81 |
| Nieves | 178 | 80 |
| Sui | 754 | 78 |
| Yuri | 1339 | 76 |
| Nicole | 161 | 73 |
+-----------+--------------+------+
(Output limit exceeded, 25 of 10000 total rows shown)

7.Are there more reviews with the word “love” or with the word “hate” in them?

Yes, there are more reviews with word love.

Total reviews with love -1780

Total reviews with Hate — 232

***********SQL CODE**********
SELECT COUNT(CASE WHEN text LIKE '%love%' THEN text END) AS Love_count,
COUNT(CASE WHEN text LIKE '%hate%' THEN text END) AS Hate_count
FROM review
*****************************
+------------+------------+
| count_love | count_hate |
+------------+------------+
| 1780 | 232 |
+------------+------------+

8.Find the top 10 users with the most fans

***********SQL CODE**********  
SELECT name,fans
FROM user
ORDER BY fans DESC
LIMIT 10
*****************************
+-----------+------+
| name | fans |
+-----------+------+
| Amy | 503 |
| Mimi | 497 |
| Harald | 311 |
| Gerald | 253 |
| Christine | 173 |
| Lisa | 159 |
| Cat | 133 |
| William | 126 |
| Fran | 124 |
| Lissa | 120 |
+-----------+------+

We can use these small check discussed above as a starting point to get an idea of our project and reference it when doing in depth analysis later on.

First Step of any analysis is with understanding the relationships between the table and doing the basic checks like :

Are there are any null Values in our data?

Which are the primary and Foreign key in our table?

What is the relationship between tables? Is it one to many or many to many and so on

Checking the head (Top 10) rows of the data

Technical Note

All the above queries is executed in Coursera’s SQL environment. Take up this course to learn more about the dataset .

Link: coursers.org

--

--

Sanchita Deb
Analytics Vidhya

Millions saw the Apple fall, but Newton asked why — A curious learner and a voracious reader