Query and Analyze Call Logs with IBM Cloud SQL Query
How to upload data with IBM Aspera and make SQL queries across data
IBM Cloud SQL Query lets you to use standard SQL and Apache Spark SQL functions to query your structured and semi-structured data stored in Cloud Object Storage (COS). It’s a serverless solution that makes it easy to analyze lots of data in COS by pointing to the COS bucket that stores your data. You don’t have to create schemas or do any extra setup, just create a COS bucket, add the data, click a button to set up the SQL Query service, then start analyzing your data using SQL.
In this post, we’ll quickly set up COS and SQL Query, upload some police call logs to COS using IBM Aspera high-speed data transfer, then run some SQL queries over multiple log files to gain insights into police calls spanning a couple years. Let’s get started …
Setting Up COS and SQL Query
To create the SQL Query service, you’ll need to have COS first. In my introduction to analyzing data with SQL Query, I’ve outlined how to set up both COS and SQL Query from the IBM Cloud UI. To give you another flavor of how to set up COS and SQL Query, we’ll create the services using the IBM Cloud Developer Tools CLI.
Easily make SQL queries on your Cloud Object Storage datamedium.com
If you haven’t installed the command line tool, check out the documentation for installation instructions to get started. Once it’s installed, log into your account with:
$ ibmcloud login
You’ll be prompted to enter your email address and password for authentication. Next, target your Cloud Foundry account which connects you to your space on IBM Cloud:
$ ibmcloud target --cf
Next, create the COS and SQL Query services using the following
ibmcloud commands. If you already have COS, you can skip the first command.
$ ibmcloud resource service-instance-create police-calls-cos cloud-object-storage lite global
Now, add SQL Query.
$ ibmcloud resource service-instance-create police-calls-sql sql-query beta us-south
Now, to view all the services that we just created, run:
$ ibmcloud resource service-instances
This will provide you with a list of services tied to your account.
Name Location State Type Tags
police-calls-cos global active service_instance
police-calls-sql us-south active service_instance
Now that we have COS and SQL Query set up, let’s get started by creating a COS bucket to store the data. Use the UI to create a COS bucket. My previous article on SQL Query went over the steps on how to get set up in detail.
Once in the COS UI, click the Create Bucket button, which will open a window for you to choose a name, select the bucket resiliency, and select its location. An important point to remember when naming buckets is that they must be unique across all existing bucket names in COS.
Let’s now move on to the data …
Uploading the Data
The logs that we’ll use for this example are publicly available Seattle Police Department call logs from 2009 to 2018. It’s nine years of data in a 653 MB file. At the moment, COS buckets in every region limit the uploading of files to 200 MB. However, buckets with “Regional” resiliency that are located in “us-south” can leverage Aspera, which will allow you to exceed the file upload limit.
We could load all of the data as a single file with Aspera, but in this example, I’ve split the 653 MB file into smaller files based on the month and year of the calls. This is to simulate how you might store your log files in COS, and it will also make your queries a lot faster if you don’t have to scan the entire dataset when querying it. Also, it provides us with the opportunity to show how to query over numerous files in COS with SQL Query.
Download the file, and afterwards let’s split it up. To help us split the CSV file, we’ll use a command-line tool called csvkit. It comes with a variety of tools that help you scrape CSV data and transform it into other formats. You can install it using Python’s pip package manager:
$ pip install csvkit
Call_Data.csv file includes the dates of each event, I’ve created a folder for each year that will store twelve CSV files according to the month of each call event. To do that, we’ll use csvkit’s
csvgrep command. This command will search in column
ORIGINAL TIME QUEUED, which is the time a call was logged, and look for the year and the month for each call. Using a regular expression, we’ll select the month and year from that column and store the results by month according to the year. Each file will be saved with a format like
1-2018.csv which represents the call log for January 2018.
To automate this process, we’ll use bash as well as the
csvgrep command. This will iterate over our CSV file, create the appropriate files, and place them in the correct folder by year.
This will take a few minutes to run.
Once the files have been saved in their folders, we can move them to your COS bucket with Aspera. In your COS bucket, click on Add Objects then select Aspera high-speed transfer. You need Aspera Connect installed to transfer files.
There is a Download Aspera Installer link to do that. Also, you’ll need to set Aspera to upload files. You can set that by clicking the Transfer options link and turning on Uploads.
Once you’ve saved these options, click Add Objects again, then with Aspera selected, click Add folder. From there, select all your folders from 2009 to 2018. Then just wait until you see the “Completed” message after all your files have been uploaded.
Querying the Log Files
When querying the files in SQL Query, you’ll need the URL for the COS bucket that contains your data. From the COS UI, you can get this URL by clicking on the kebab icon on the right side of the bucket name.
Select the Bucket SQL URL option.
This will open up a window to select the bucket’s URL. Copy it.
Next, you’ll need to go back to your services page and select the SQL Query service. Once selected, you’ll be taken to the service’s page. Click the Open UI button, which will take you to the SQL Query data browser.
One of the simplest ways to test queries and get a snapshot of the results before running more complex SQL queries is to use the data browser.
In the data browser, use your COS bucket’s URL to create a simple query like the following which will give you the headers as well as a snapshot of the data within the January 2018 log file:
SELECT * FROM cos://us-south/your_bucket/2018/1–2018.csv
It’s pretty simple to run a query because you don’t have to set up tables beforehand. All you need to do is dump your data into COS and SQL Query will create the schema and scan the data behind the scenes. If we wanted to run the same query over all of the 2018 data, we just need to use the 2018 prefix, and SQL Query will scan all the files from that year:
SELECT * FROM cos://us-south/police-call-logs/2018/
For combining multiple log files in separate partitions, you can use the
UNION ALL SQL clause to combine multiple files with different prefixes together. For example, if we want the logs for January 2017 and 2018, we’d write:
SELECT * FROM cos://us-south/police-call-logs/2018/1–2018.csv
SELECT * FROM cos://us-south/police-call-logs/2017/1–2017.csv
For more complex queries, we can use some of Spark SQL’s built-in functions to get the call logs for Christmas 2016 and 2017. We’ll use a common table expression (CTE) to combine the data for December 2016 and 2017. Then, we’ll use the column
ORIGINAL TIME QUEUED, which is a string containing a call’s date and time, and convert it to a date and time type that we can use in our queries and place that in a column we defined as
unix_timestamp function converts the date and time like “12/25/2017 08:44:10 PM” into a Unix timestamp
1514234650. The function
from_unixtime then converts the Unix timestamp into an ISO-8601-formatted date with 24-hour clock time “2017–12–25 20:44:10”.
Next, we’ll use the
xmas_calls CTE to list the types of calls that occurred on Christmas for 2016 and 2017 and group them together by call priority to give us a total for each call type. Using the
HAVING SQL clause, we can filter the results to give us only totals over five.
If we wanted to only look for call logs pertaining to traffic incidents, we could also include
`INITIAL CALL TYPE` like "%TRAFFIC%" within the query like:
Another interesting query to run is the ranking of police call logs related to traffic incidents in 2016. You could use the
RANK() OVER() syntax in SQL Query to accomplish this.
The only limit you have with SQL Query is your ability to use SQL. Standard SQL will work for any query, and you can also use Spark SQL functions to fine tune queries even more.
Analyzing data over multiple files in COS is easy using SQL Query. All that’s required is for you to point to the files or COS bucket prefix, and then you can start using standard SQL to gain insights into your data.
For a tutorial on how you could use SQL Query to analyze logs in real-time, take a look at the Big data logs with streaming analytics tutorial in the IBM Cloud docs. If you’re interested in adding some visualization to your log data, you could try my article that looks at using SQL Query, Jupyter Notebooks, and PixieDust.