Parsing Oracle SQL with FlowHigh

Uli Bethke
9 min readAug 10, 2023

This is the second post in a series on SQL parsing. In the first part we went through SQL parsing on Snowflake. In this blog post we will walk you through all the details of parsing SQL on Oracle.

We first explain what an SQL parser is and some common use cases. We then give you hands-on examples of parsing Oracle SQL to detect anti patterns in SQL. We will also share some code that parses Oracle SQL programmatically using the FlowHigh SQL parser SDK. Using the SDK you can automate the parsing of Oracle SQL.

What is an SQL parser and why do you need one?

An SQL parser takes an SQL statement and splits it into its components and elements. This is useful if you want to understand what tables and columns are used (or not used), columns for joins or columns for filters etc. There are dozens of use cases.

I have written a detailed article on the benefits of using an SQL parser on the Sonra blog. There are use cases for both data engineering and data governance.

SQL parser for data engineering

  • Query conversion between SQL dialects or between SQL dialect and ETL tool
  • Visualising complex SQL for documentation and debugging
  • Formatting or linting SQL
  • Detecting bad SQL queries aka as SQL anti patterns
  • Automated testing
  • Impact analysis for data pipelines. How does making a change impact the pipeline?
  • SQL query rewriting

SQL parser for data governance

  • Data lineage
  • Business logic tracing
  • Data protection tracking, e.g. for PII
  • Documentation

Let’s have a look at some real world examples of SQL parsing using FlowHigh.

Oracle SQL Parser in action

FlowHigh, is our potent free online SQL parser. It is an invaluable tool for parsing SQL on Oracle. This SaaS platform provides a user-friendly UI for manual SQL parsing and an SDK for automating the process or handling bulk SQL parsing requirements. We demonstrate FlowHigh’s capabilities in parsing Oracle’s query history. We used the SDK to parse the query history programmatically.

Programmatically parsing the Oracle query history with the FlowHigh SDK

Oracle maintains a detailed log of every SQL statement executed by a user in its query history. This information is accessible through Oracle’s system views and tables.

Oracle’s v$sql and v$sqlarea are dynamic performance views crucial for SQL analysis. The v$sql view offers insights into SQL statements that have been recently parsed or executed, detailing their performance metrics and text. On the other hand, v$sqlarea provides an aggregated perspective, grouping similar SQL statements for a broader overview. Together, they serve as essential tools for understanding SQL execution patterns and behaviors within the Oracle database.

We used the following SQL query to retrieve the necessary information from Oracle’s query history:

SELECT v.SQL_TEXT,
v.SQL_ID
FROM v$sql v

Subsequently, we parsed all the queries retrieved from the query history using the Python code provided below:

import os
import cx_Oracle
from flowhigh.utils.converter import FlowHighSubmissionClass

# Set environment variables
os.environ["LD_LIBRARY_PATH"] = "/home/Downloads/Oracle/instantclient_19_20"

# Connect to the Oracle database
connection = cx_Oracle.connect("system", "password", "localhost/ORCLCDB")
cursor = connection.cursor()

# Specify the query
query = '''
SELECT v.SQL_TEXT,
v.PARSING_SCHEMA_NAME,
v.FIRST_LOAD_TIME,
v.DISK_READS,
v.ROWS_PROCESSED,
v.ELAPSED_TIME,
v.service
FROM v$sql v
'''
sql_insert = """
INSERT INTO C##FH.json_table (query_text, json_data)
VALUES (:1, :2)
"""

# Execute the query
cursor.execute(query)
rows = cursor.fetchall()

# Parse each SQL statement with FlowHigh and insert the JSON output into an Oracle table
for row in rows:
fh = FlowHighSubmissionClass.from_sql(row[0])
js =fh.json_message
cursor.execute(sql_insert, [row[0], js])
connection.commit()

# Close the cursor and the connection
cursor.close()
connection.close()

The FlowHigh SQL parser accepts an SQL statement and returns the parsed output as a structured message, in either JSON or XML format. For instance, the following SQL query: is transformed into a detailed JSON message that breaks down all components of the SQL statement, such as the filter conditions, the fields selected, the aliases used, etc.

SELECT COUNT(1) 
FROM dba_views
WHERE UPPER(text_vc) LIKE '%JSON_DATAGUIDE%'
AND owner NOT IN (SELECT DISTINCT username
FROM all_users
WHERE oracle_maintained='Y')

The resulting XML is also comprehensive and serves as a more compact alternative to the JSON output. For instance, a sample XML output for the SELECT statement is as follows:

<parSeQL version="1.0" status="OK" ts="2023-08-03T15:15:14.040Z" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://flowhigh.sonra.io/flowhigh_v1.0.xsd">
<statements>
<statement pos="0-228">
...
<ds pos="0-228" type="root" subType="inline">
...
<filter xsi:type="filtreg">
<op pos="58-170" type="AND">
<op pos="58-38" type="LIKE">
<func pos="58-14" name="UPPER" subType="string_binary" xsi:type="fscalar">
<attr pos="64-7" dboref="C1"/>
</func>
<const>'%JSON_DATAGUIDE%'</const>
</op>
<op pos="101-127" type="NOT IN">
...
</op>
</op>
</filter>
</ds>
...
</statement>
</statements>
...
</parSeQL>

Here, we see a filter with two conditions combined using an AND operator. The first condition checks if the “text_vc” column contains the word “JSON_DATAGUIDE” after converting to uppercase. The second condition checks if the “owner” is not one of the distinct usernames from the “all_users” table where “oracle_maintained” is ‘Y’. Each of these filter expressions, their positions in the SQL, their types, and their values are precisely represented in the parsed output.

FlowHigh’s SQL parser, combined with the XSD schema, allows us to dissect, understand, and manipulate SQL statements from Oracle’s query history with unprecedented precision and flexibility. Now, as we analyze the results of the parsed queries, we’re able to gain deeper insights into the characteristics and patterns of the SQL statements.

Manually parsing Oracle SQL using the FlowHigh UI

In this section we use the FlowHigh web based user interface to parse SQL. We also apply an SQL parser use case and detect SQL anti patterns. An anti pattern is bad SQL, e.g. overusing SELECT * or specifying a function over a column in a WHERE clause.

FlowHigh works primarily by scanning the input SQL queries and identifying patterns of code that are inefficient, error-prone, or difficult to maintain, referred to as “anti-patterns.” Excessive use of joins, too complex subqueries, and incorrect use of indexes are examples of anti-patterns. The tool can assist database developers and administrators prevent potential performance difficulties, improve code readability, and guarantee that best practices are followed by recognizing such problematic structures early on.

As a result, an anti-pattern detection tool serves not only as a safeguard against frequent SQL programming mistakes, but also as an educational resource to increase general SQL coding proficiency.

Let’s go through an example

We have a query of a regular complexity with 100 lines of code.

WITH all_sales AS (
SELECT
year,
brand_id,
class_id,
category_id,
manufact_id,
SUM(sales_cnt) AS sales_cnt,
SUM(sales_amt) AS sales_amt
FROM
(
SELECT
d.year,
i.brand_id,
i.class_id,
i.category_id,
i.manufact_id,
cs.quantity - COALESCE(cr.return_quantity, 0) AS sales_cnt,
cs.ext_sales_price - COALESCE(cr.return_amount, 0.0) AS sales_amt
FROM
catalog_sales cs
JOIN item i ON i.item_sk = cs.item_sk
JOIN date_dim d ON d.date_sk = cs.sold_date_sk
LEFT JOIN catalog_returns cr ON (
cs.order_number = cr.order_number
AND cs.item_sk = cr.item_sk
)
WHERE
i.category = 'Shoes'
UNION
SELECT
d.year,
i.brand_id,
i.class_id,
i.category_id,
i.manufact_id,
ss.quantity - COALESCE(sr.return_quantity, 0) AS sales_cnt,
ss.ext_sales_price - COALESCE(sr.return_amt, 0.0) AS sales_amt
FROM
store_sales ss
JOIN item i ON i.item_sk = ss.item_sk
JOIN date_dim d ON d.date_sk = ss.sold_date_sk
LEFT JOIN store_returns sr ON (
ss.ticket_number = sr.ticket_number
AND ss.item_sk = sr.item_sk
)
WHERE
i.category = 'Shoes'
UNION
SELECT
d.year,
i.brand_id,
i.class_id,
i.category_id,
i.manufact_id,
ws.quantity - COALESCE(wr.return_quantity, 0) AS sales_cnt,
ws.ext_sales_price - COALESCE(wr.return_amt, 0.0) AS sales_amt
FROM
web_sales ws
JOIN item i ON i.item_sk = ws.item_sk
JOIN date_dim d ON d.date_sk = ws.sold_date_sk
LEFT JOIN web_returns wr ON (
ws.order_number = wr.order_number
AND ws.item_sk = wr.item_sk
)
WHERE
i.category = 'Shoes'
) sales_detail
GROUP BY
year,
brand_id,
class_id,
category_id,
manufact_id
)
SELECT
prev_yr.year AS prev_year,
curr_yr.year AS year,
curr_yr.brand_id,
curr_yr.class_id,
curr_yr.category_id,
curr_yr.manufact_id,
prev_yr.sales_cnt AS prev_yr_cnt,
curr_yr.sales_cnt AS curr_yr_cnt,
curr_yr.sales_cnt - prev_yr.sales_cnt AS sales_cnt_diff,
curr_yr.sales_amt - prev_yr.sales_amt AS sales_amt_diff
FROM
all_sales curr_yr,
all_sales prev_yr
WHERE
curr_yr.brand_id = prev_yr.brand_id
AND curr_yr.class_id = prev_yr.class_id
AND curr_yr.category_id = prev_yr.category_id
AND curr_yr.manufact_id = prev_yr.manufact_id
AND curr_yr.year = 2000
AND prev_yr.year = 2000 - 1
AND CAST(curr_yr.sales_cnt AS DECIMAL(17, 2)) / CAST(prev_yr.sales_cnt AS DECIMAL(17, 2)) < 0.9
ORDER BY
sales_cnt_diff;

Lets login to FlowHigh and go to the Optimiser module. Paste the SQL statement and click “Optimize”.

Get access to Flowhigh.

There may be more than one anti pattern in the same SQL statement.

When you optimize a query, the first antipattern will be opened automatically, but you can also click on an icon in a line to open the antipattern that you want to check.

In the right side you will get a description of the anti pattern with an explanation, the type of anti pattern, the severity and suggestions on how to fix it.

Each antipattern will have one or up to 3 types. There are three different types: Readability, Performance and Correctness.

Readability: Queries that contain antipatterns of this type, tend to be more complex, harder to read and change in the future.

For example: using arithmetic and string operations on NULLable columns could have side effects that you didn’t plan for.

SELECT customer_id
,CONCAT(first_name
,middle_name
,last_name) AS full_name
FROM customer

As the middle_name is unknown for customer_id 2 the concat string operation also results in an unknown value (NULL). We might have expected “edgar uelzenguelz” as the full name for that customer_id

For the string operation we can use the COALESCE function to achieve our objective

SELECT customer_id
,CONCAT(first_name
,' '
,COALESCE(middle_name
,'')
,' '
,last_name) AS full_name
FROM customer

Performance: Queries that contain antipatterns of this type, have parts of queries that can slow down the speed of queries being executed and increase the amount of resources needed for the execution of it.

For example: Whenever a developer uses SELECT *, they commonly pick more columns than they need. Without any justification, the database must free up more system resources. The use of this anti-pattern could cause performance problems, reduce concurrency, and increase costs.

SELECT *
FROM web_sales a
JOIN item b
ON(a.ws_item_sk=b.i_item_sk)

Correctness: Queries that contain antipatterns of this type, have properties that might need additional checks to ensure that queries always return correct results.

For example: If any row of a subquery using NOT IN returns NULL, the entire NOT IN operator will evaluate to either FALSE or UNKNOWN and no records will be returned. NOT IN will not match any rows if the subquery just returns a single row with NULL.

You create a query that returns orders in the orders table without a record in the order_details table

SELECT order_id
FROM orders
WHERE order_id NOT IN(SELECT order_id
FROM order_details);

You might expect to get a table with row 4, but instead you get a message that the query produced no results.

Instead you can rewrite your query using a defensive programming style to check for NULL in the subquery.

SELECT *
FROM orders
WHERE order_id NOT IN(SELECT order_id
FROM order_details
WHERE order_id IS NOT NULL);

When you optimize a query, at the bottom of SQL Editor you will be able to expand a box that contains a list of all antipatterns.

From there you are able to open any antipattern that you want to check.

There is also an option to generate antipatterns as JSON. For that we can go to SDK and click “GO”.

If you want to find out more about FlowHigh contact us or register.

--

--