If you haven’t read my previous article, please click below, that definitely will be helpful :) !
Let’s start !
Case study:
We want to know the number of each categories and the condition is that the wording of “description” and “category”.
Jest take a look at below code. Explain later.
We can find out where description like ‘…..trash….’
LIKE
and ILIKE
queries will help you find relevant descriptions and categories.
%
matches 0 or more characters.
Difference between this two is
ILIKE
don’t need to consider character is upper or lower.
Function for character data type:
- Case study:
Select the first 50 characters of description
when description
starts with the word "I".
The description
column of evanston311
can be very long. You can get the length of a string with the length()
function.
For displaying or quickly reviewing the data, you might want to only display the first few characters. You can use the left()
function to get a specified number of characters at the start of each value.
To indicate that more data is available, concatenate '...'
to the end of any shortened description
. To do this, you can use a CASE WHEN
statement to add '...'
only when the string length is greater than 50.
So I case when to set up condition I want to get is length more than 50 characters, then combine left 50 character with ‘…’, then the result as below.
Case study:
There are almost 150 distinct values of evanston311.category
. But some of these categories are similar, with the form "Main Category - Details". We can get a better sense of what requests are common if we aggregate by the main category.
To do this, create a temporary table recode
mapping distinct category
values to new, standardized
values. Make the standardized
values the part of the category before a dash ('-'
). Extract this value with the split_part()
function:
split_part(string text, delimiter text, field int)
also need to do some additional cleanup of a few cases that don’t fit this pattern.
Then the evanston311
table can be joined to recode
to group requests by the new standardized
category values.
Case study:
Determine whether medium and high priority requests in the evanston311
data are more likely to contain requesters' contact information: an email address or phone number.
- Emails contain an @.
- Phone numbers have the pattern of three characters, dash, three characters, dash, four characters. For example: 555–555–1212.
Use LIKE
to match these patterns. Remember %
matches any number of characters (even 0), and _
matches a single character. Enclosing a pattern in %
(i.e. before and after your pattern) allows you to locate it within other text.
For example, '%___.com%'
would allow you to search for a reference to a website with the top-level domain '.com'
and at least three characters preceding it.
Create and store indicator variables for email and phone in a temporary table. LIKE
produces True or False as a result, but casting a boolean (True or False) as an integer
converts True to 1 and False to 0. This makes the values easier to summarize later.
If you enjoy this article please click the thump up (Keep pressing can send multiple thump up!) and follow me, I’ll keep updating content about data science! Hope you like it.
Reference:
Christina Maimone: EXPLO RATO RY DATA ANALYSIS IN SQL, datacamp