How to implement TPC-H queries with GreenplumPython

Ruxue Zeng

Published in

Greenplum Data Clinics

6 min readMar 6, 2023

A quick demonstration and examples

TPCH benchmark

TPC-H is a benchmark developed to evaluate the performance of large-scale SQL and relational databases by the execution of sets of queries. It has 22 queries against a standard database under controlled conditions. These queries:

Give answers to real-world business questions
Are far more complex than most OLTP transactions
Include a rich breadth of operators and selectivity constraints
Generate intensive activity on the part of the database server component of the system under test

Data overview

TPC-H can generate data within eight tables. The data volume is defined by Scale Factor (SF). This blog uses 1 Gb of raw data, i.e. SF=1.

Queries execution

The 22 queries included in TPC-H demonstrate six common choke points to a database:

Aggregation Performance
Join Performance
Data Access Locality
Expression Calculation
Correlated Subqueries
Parallel Execution

This blog has chosen five queries among 22 to show the expressiveness of GreenplumPython.

Preparing the environment

# Database Connection
import greenplumpython as gp
db = gp.database(uri="postgresql://localhost/gpadmin")

# Create GreenplumPython DataFrame from an existing table
customer = db.create_dataframe(table_name="customer")
lineitem = db.create_dataframe(table_name="lineitem")
nation = db.create_dataframe(table_name="nation")
orders = db.create_dataframe(table_name="orders")
part = db.create_dataframe(table_name="part")
partsupp = db.create_dataframe(table_name="partsupp")
region = db.create_dataframe(table_name="region")
supplier = db.create_dataframe(table_name="supplier")

# Access to GreenplumPython builtin functions
import greenplumpython.builtins.functions as F

Q1: Pricing Summary Report Query

This query reports the amount of business billed, shipped, and returned. The date is within 90 days from the ship date in the database.

SQL

select
 l_returnflag,
 l_linestatus,
 sum(l_quantity) as sum_qty,
 sum(l_extendedprice) as sum_base_price,
 sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
 sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
 avg(l_quantity) as avg_qty,
 avg(l_extendedprice) as avg_price,
 avg(l_discount) as avg_disc,
 count(*) as count_order
from
 lineitem
where
 l_shipdate <= date '1998-12-01' - interval '90 days'
group by
 l_returnflag,
 l_linestatus
order by
 l_returnflag,
 l_linestatus;

GreenplumPython

It involves many aggregations; this can be easily translated GreenplumPython using its built-in functions.

q1_gp = (
    lineitem
    .where(lambda t: t["l_shipdate"] <= datetime(1998, 12, 1) - timedelta(days=90))
    .group_by("l_returnflag", "l_linestatus")
    .assign(
        sum_qty=lambda t: F.sum(t["l_quantity"]),
        sum_base_price=lambda t: F.sum(t["l_extendedprice"]),
        sum_disc_price=lambda t: F.sum(t["l_extendedprice"] * (-t["l_discount"] + 1)),
        sum_charge=lambda t: F.sum(t["l_extendedprice"] * (-t["l_discount"] + 1) * (t["l_tax"] + 1)),
        avg_qty=lambda t: F.avg(t["l_quantity"]),
        avg_price=lambda t: F.avg(t["l_extendedprice"]),
        avg_disc=lambda t: F.avg(t["l_discount"]),
        count_order=lambda _: F.count(),
    )
    .order_by("l_returnflag")
    .order_by("l_linestatus")[:]
)
q1_gp

Q2: Minimum Cost Supplier Query

This query finds which supplier should be selected to order all parts with type like “%BRASS” and size = 4 in Asia.

SQL

select
    s_acctbal,
    s_name,
    n_name,
    p_partkey,
    p_mfgr,
    s_address,
    s_phone,
    s_comment
from
    part,
    supplier,
    partsupp,
    nation,
    region
where
    p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and p_size = 4
    and p_type like '%BRASS'
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'ASIA'
    and ps_supplycost = (
        select min(ps_supplycost)
        from
            partsupp,
            supplier,
            nation,
            region
        where
            p_partkey = ps_partkey
            and s_suppkey = ps_suppkey
            and s_nationkey = n_nationkey
            and n_regionkey = r_regionkey
            and r_name = 'ASIA'
    )
order by
    s_acctbal desc,
    n_name,
    s_name,
    p_partkey
LIMIT 5;

GreenplumPython

It involves many tables to be joined together; furthermore, these joins appeared not once. To simplify the query, we can break it into two parts with GreenplumPython: first calculates all supply cost, then use it to find the minimum cost for each part.

supply_cost = (
    part
    .where(lambda t: (t["p_size"] == 4) & (t["p_type"].like(r"%BRASS")))
    .join(partsupp, cond=lambda s, o: s["p_partkey"] == o["ps_partkey"])
    .join(supplier, cond=lambda s, o: s["ps_suppkey"] == o["s_suppkey"])
    .join(nation, cond=lambda s, o: s["s_nationkey"] == o["n_nationkey"])
    .join(
        region[lambda t: t["r_name"] == "ASIA"],
        cond=lambda s, o: s["n_regionkey"] == o["r_regionkey"],
    )
)
q2_gp = (
    supply_cost.join(
        supply_cost.group_by("p_partkey").assign(
            min_supplycost=lambda t: F.min(t["ps_supplycost"]),
        ),
        on="p_partkey",
        other_columns={"min_supplycost"},
    )
    .where(lambda t: t["ps_supplycost"] == t["min_supplycost"])[
        [
            "s_acctbal",
            "s_name",
            "n_name",
            "p_partkey",
            "p_mfgr",
            "s_address",
            "s_phone",
            "s_comment",
        ]
    ]
    .order_by("s_acctbal", ascending=False)
    .order_by("n_name")
    .order_by("s_name")
    .order_by("p_partkey")[:5]
)
q2_gp

Q11: Important Stock Identification

This query finds the most important subset of suppliers’ stock in Japan.

SQL

select
 ps_partkey,
 sum(ps_supplycost * ps_availqty) as value
from
 partsupp,
 supplier,
 nation
where
 ps_suppkey = s_suppkey
 and s_nationkey = n_nationkey
 and n_name = 'JAPAN'
group by
 ps_partkey having
  sum(ps_supplycost * ps_availqty) > (
   select
    sum(ps_supplycost * ps_availqty) * 0.0001000000
   from
    partsupp,
    supplier,
    nation
   where
    ps_suppkey = s_suppkey
    and s_nationkey = n_nationkey
    and n_name = 'JAPAN'
  )
order by
 value desc
LIMIT 5;

GreenplumPython

It involves complex grouping and many joins. To simplify the query, we can also break it into two parts with GreenplumPython: firstly, scan the available stock of suppliers in Japan, then find all the parts that represent 0.01% of the total value of all available parts.

supply_value = (
    partsupp
    .join(supplier, cond=lambda s, o: s["ps_suppkey"] == o["s_suppkey"])
    .join(
        nation[lambda t: t["n_name"] == "JAPAN"],
        cond=lambda s, o: s["s_nationkey"] == o["n_nationkey"],
    )
    .group_by("ps_partkey")
    .assign(value=lambda t: F.sum(t["ps_supplycost"] * t["ps_availqty"]))
)
q11_gp = (
    supply_value.join(
        supply_value.group_by().assign(value=lambda t: F.sum(t["value"])),
        cond=lambda s, o: s["value"] > o["value"] * 0.0001,
        other_columns={},
    )
    .order_by("value", ascending=False)[:5]
)
q11_gp

Q12: Shipping Modes and Order Priority Query

This query determines whether selecting less expensive modes of shipping is negatively affecting the critical-priority orders by causing more parts to be received by customers after the committed date (1994–01–01).

SQL

select
 l_shipmode,
 sum(case
  when o_orderpriority = '1-URGENT'
   or o_orderpriority = '2-HIGH'
   then 1
  else 0
 end) as high_line_count,
 sum(case
  when o_orderpriority <> '1-URGENT'
   and o_orderpriority <> '2-HIGH'
   then 1
  else 0
 end) as low_line_count
from
 orders,
 lineitem
where
 o_orderkey = l_orderkey
 and l_shipmode in ('TRUCK', 'MAIL')
 and l_commitdate < l_receiptdate
 and l_shipdate < l_commitdate
 and l_receiptdate >= date '1994-01-01'
 and l_receiptdate < date '1994-01-01' + interval '1 year'
group by
 l_shipmode
order by
 l_shipmode;

GreenplumPython

It involves complex functions. With GreenplumPython, we can define a class to count high and low lines and create an aggregate function to calculate the count.

# Define return type of aggregate function
class PriorityCount:
    high_line_count: int
    low_line_count: int

# Create aggregate function to count priority of high and low line
@gp.create_aggregate
def count_priority(count: PriorityCount, priority: str) -> PriorityCount:
    res = {
        "high_line_count": 0 if count is None else count["high_line_count"],
        "low_line_count": 0 if count is None else count["low_line_count"],
    }
    if priority == "1-URGENT" or priority == "2-HIGH":
        res["high_line_count"] += 1
    else:
        res["low_line_count"] += 1
    return res

After creating these elements, we can do a simple filtration and application.

text_type = gp.type_("text")

q12_gp = (
    lineitem
    .where(
        lambda t: ((t["l_shipmode"] == "TRUCK") | (t["l_shipmode"] == "MAIL"))
        & (t["l_commitdate"] < t["l_receiptdate"])
        & (t["l_shipdate"] < t["l_commitdate"])
        & (t["l_receiptdate"] >= datetime(1994, 1, 1))
        & (t["l_receiptdate"] < datetime(1994 + 1, 1, 1))
    )
    .join(orders, cond=lambda s, o: s["l_orderkey"] == o["o_orderkey"])
    .group_by("l_shipmode")
    .apply(lambda t: count_priority(text_type(t["o_orderpriority"])), expand=True)
    .order_by("l_shipmode")[:]
)

q12_gp

Q20: Potential Part Promotion Query

This query identifies suppliers in the United Kingdom having parts like “metallic%” that may be candidates for a promotional offer.

SQL

select
 s_name,
 s_address
from
 supplier,
 nation
where
 s_suppkey in (
  select
   ps_suppkey
  from
   partsupp
  where
   ps_partkey in (
    select
     p_partkey
    from
     part
    where
     p_name like 'metallic%'
   )
   and ps_availqty > (
    select
     0.5 * sum(l_quantity)
    from
     lineitem
    where
     l_partkey = ps_partkey
     and l_suppkey = ps_suppkey
     and l_shipdate >= date '1995-01-01'
     and l_shipdate < date '1995-01-01' + interval '1 year'
   )
 )
 and s_nationkey = n_nationkey
 and n_name = 'UNITED KINGDOM'
order by
 s_name
LIMIT 10;

GreenplumPython

It involves queries with nested subqueries. With GreenplumPython, we can use in_ operator to carry this query type.

q20_gp = (
    supplier.join(
        nation,
        cond=lambda s, o: (o["n_nationkey"] == s["s_nationkey"])
        & (o["n_name"] == "UNITED KINGDOM"),
    )[
        lambda t: t["s_suppkey"].in_(
            partsupp
                .join(
                    lineitem
                    .where(
                        lambda t: (t["l_shipdate"] >= datetime(1995, 1, 1))
                        & (t["l_shipdate"] < datetime(1995 + 1, 1, 1))
                    )
                    .group_by("l_partkey", "l_suppkey")
                    .assign(sum=lambda t: F.sum(t["l_quantity"])),
                    cond=lambda s, o: (s["ps_partkey"] == o["l_partkey"])
                    & (s["ps_suppkey"] == o["l_suppkey"])
                    & (s["ps_availqty"] > o["sum"] * 0.5),
                )
                [
                lambda s: s["ps_partkey"].in_(
                        part[lambda t: t["p_name"].like(r"metallic%")]["p_partkey"]
                    )
                ]["ps_suppkey"]
        )
    ]
    [["s_name", "s_address"]]
    .order_by("s_name")[:10]
)
q20_gp

Conclusion

This blog is not trying to evaluate the performance of database systems but to show the expressiveness of Greenplumpython. It is pretty easy to translate the query into Python syntax with GreenplumPython; you can break a complex SQL query down into several simple Python commands.

NOTE: This blog does a “literal” translation from SQL to GreenplumPython; there would be many changes in an actual implementation case to achieve satisfactory performance.

How to implement TPC-H queries with GreenplumPython

TPCH benchmark

Data overview

Queries execution

Preparing the environment

Q1: Pricing Summary Report Query

Q2: Minimum Cost Supplier Query

Q11: Important Stock Identification

Q12: Shipping Modes and Order Priority Query

Q20: Potential Part Promotion Query

Conclusion

More resources

Written by Ruxue Zeng