Analyzing Property Sales with the Snowflake ML Contribution Explorer

The Snowflake ML Contribution Explorer behaves differently from the other Snowflake ML functions (forecasting, anomaly detection, and classification) I’ve written about previously. Let’s see how we can use the contribution explorer to understand property sales fluctuations.

Photo by Nick Kelly on Unsplash

To start, I found the NYC Property Sales Data on data.world. It’s a csv file containing property sales data for New York City from November 2018 through October 2019.

To get this data into my Snowflake account, I created a custom role, a warehouse, a database, a schema, an internal stage, and uploaded the file manually to the stage:

-- create a custom role
use role useradmin;
create role data_scientist;

-- grant the role to myself so I can use it
grant role data_scientist to user <my current user>;

-- create a warehouse, database, schema, and grant privileges
use role sysadmin;
create database top_insights_db;
grant usage on database top_insights_db to role data_scientist;
create schema property_sales_analysis;
grant all on schema property_sales_analysis to role data_scientist;
create warehouse data_science_wh with warehouse_size = 'xsmall';
grant usage on warehouse data_science_wh to role data_scientist;

-- continue working with the newly created role
use role data_scientist;
use warehouse data_science_wh;
use database top_insights_db;
use schema property_sales_analysis;

-- create an internal stage
create stage property_sales_stg;

-- manually upload the csv file to the internal stage

Then I created a file format for the csv file, a target table to store the contents of the csv file using the INFER_SCHEMA() function and copied the data from the csv file in the internal stage into the target table:

-- create a file format for the csv file
create file format my_csv_format
type = csv
parse_header = TRUE
field_optionally_enclosed_by = '"';

-- create the target table
create table property_sales using template(
select array_agg(object_construct(*))
from table(
infer_schema(
location => '@property_sales_stg',
file_format => 'my_csv_format',
ignore_case => TRUE
)
));

-- copy the data from the internal stage into the target table
copy into property_sales
from @property_sales_stg
file_format = (format_name = 'my_csv_format')
match_by_column_name = case_insensitive;

With the prep work done, now comes the fun part analyzing the data. Here are some of the more interesting columns in the table that I will use for further analysis:

  • Borough: one of the New York City’s boroughs (Bronx, Brooklyn, Manhattan, Queens, Staten Island)
  • Neighborhood: the name of the neighborhood. Interesting discovery: both Queens and Staten Island have a neighborhood called Sunnyside. Therefore, for the analysis, to distinguish the neighborhood, I will append it to the Borough.
  • Building class category: here we have values like One Family Dwellings, Two Family Dwellings, Rentals, Warehouses, Religious Facilities, Factories, Commercial Garages, Hospital And Health Facilities, Office Buildings, and so on. From a property sales perspective, it makes no sense to compare sales between these categories. For the analysis, I will work with a single building class, One Family Dwellings.
  • Gross square feet: the size of the property in square feet
  • Building age: the age of the building calculated as the difference between the current year and the year in which the property was built. Some rows are missing the year in which the property was built so I will exclude them from the analysis.
  • Sale price: the price at which the property sold
  • Sale date: the date of the sale

After the initial data analysis, we are ready to continue exploring.

What is the Contribution Explorer?

According to the Snowflake documentation, the contribution explorer is designed to streamline and improve the process of root cause analysis around changes in observed metrics. Taking values of some metric over time, the contribution explorer analyzes shifts in data with respect to that metric to determine which data segments are driving each shift.

In terms of the property sales we are analyzing, we want to use the contribution explorer to understand which segments (or dimensions) contribute to fluctuations in property sales.

Let’s look at the data. Limiting the data to One Family Dwellings and excluding properties that have no value for the year built, the total sale prices have the following distribution by day:

select sale_date, sale_price
from sales_prices
where building_class_category = '01 One Family Dwellings'
and year_built > 0;
Total property sales by day, limited to One Family Dwellings

The total sale price distribution shows that the property selling market had significantly more sales during June 2019 as compared to the other months. We want to understand what contributes to this fluctuation.

To use the contribution explorer, we need data in the following structure:

  • One or more non-negative metrics: our metric is the total sale price.
  • One or more timestamps: this is the property sale date in our analysis.
  • Dimensions that can be used to segment the data: we will use the Borough and the Neighborhood as categorical dimensions and Gross Square Feet and the Building Age as continuous dimensions.
  • Label: this must be a Boolean value that distinguishes between control and test data where TRUE represents test data, and FALSE represents control data. In our analysis, the property sales during June 2019 represent the test data because this is the data we want insights about. The property sales before June 2019 represent control data which is used to train the machine learning model.

Bringing it all together, let’s construct a query that performs the contribution exploration by calling the TOP_INSIGHTS() Snowflake ML function:

-- prepare input data by selecting from the property sales table
with input_data as (
select
-- construct an object that represents categorical dimensions
{
'borough': borough,
'neighborhood': borough || '-' || neighborhood
}
as categorical_dimensions,
-- construct an object that represents continuous dimensions
{
'gross_square_feet': gross_square_feet,
'building_age': year(current_date()) - year_built
}
as continuous_dimensions,
-- the sale price represents the measure, must be a float data type
sale_price::float as sale_price,
-- the label is TRUE during June 2019 (representing the test data)
-- and FALSE otherwise (representing the control data)
case
when sale_date between '2019-06-01' and '2019-06-30' then TRUE
else FALSE
end as label
from property_sales
-- filter for One Family Dwellings
where building_class_category = '01 One Family Dwellings'
-- rows that have a value for the year built
and year_built > 0
-- data up to end June 2019
and sale_date <= '2019-06-30'
)
select results.*
from input_data,
-- call the TOP_INSIGHTS() table function
table(SNOWFLAKE.ML.TOP_INSIGHTS(
input_data.categorical_dimensions,
input_data.continuous_dimensions,
input_data.sale_price,
input_data.label
)
-- partition by 0 to ensure that the tabular input is processed together
over (partition by 0)
) as results
-- order by surprise in descending order
order by results.surprise desc;

Let’s review this query. The WITH statement creates a CTE (common table expression) named input_data that prepares the input data by constructing objects that represent the categorical and continuous dimensions, the label column and the sale price as a float data type representing the measure. Then we call the TOP_INSIGHTS() table function using the input_data CTE.

As opposed to the other Snowflake ML functions (forecasting, anomaly detection, and classification), we don’t have to train a model when calling the TOP_INSIGHTS() table function in the contribution exploration. The model is trained on the control data and applied to the test data in the background so we don’t have to perform these steps manually. The function returns the results after applying the model.

Interpreting results from the TOP_INSIGHTS() function

The TOP_INSIGHTS() function finds the most important dimensions in a dataset, builds segments from those dimensions, and then detects which of those segments influenced the metric.

Here is the output of the previous query:

Output of the TOP_INSIGHTS() function

The query sorts the results by the surprise in descending order. The surprise represents the amount by which the test metric differs from the control metric in the segment. In our results, the first line of the output shows that the segment is not Bronx, the building age is greater than 104.5 years and the neighborhood is not Bayside in Queens.

When we look at the first few lines in the result, we see that the segments where the surprise is the highest are where the borough is either Manhattan or not Bronx, the neighborhood is not Bayside in Queens and the building age is greater than 104.5 years. These appear to be the factors affecting the total property sales the most during the test period which is June 2019. It appears there were a significant number of sales made of older buildings in Manhattan during that time.

The TOP_INSIGHTS() function returns more variables in addition to those shown above as explained in more detail in the documentation here.

We can tweak the query further by adding more dimensions from the data in the source data set, filter for different building class categories, label different time periods and so on.

The business cases where we can use the contribution explorer functionality are numerous, especially when we have many dimensions in a data set where we want to understand fluctuations in any of the measures.

I’m Maja Ferle, Snowflake Data Superhero and a senior consultant at In516ht. You can get in touch with me on LinkedIn or read my latest book, Snowflake Data Engineering.

--

--