Feature Engineering with Snowflake, Using Snowpark and Scala

In this article I will show you how easy it is to do Feature Engineering with Snowflake using the recently released Snowpark: a DataFrame API that enables data engineers, data scientists, and developers to use their preferred language and familiar program concepts, and then execute these workloads directly within Snowflake. Snowpark, currently in preview, initially supports Scala/Java and will soon also support Python.

Introduction to Feature Engineering

Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data. Simply put, we want to create new features (columns) for our dataset using the data we already have in it. One could also argue that we could create new features by adding additional data, by joining existing datasets or for example using the Snowflake Data Marketplace, and that is also true and might be the subject for a future article.

The features we can create based on existing data are very dependent on the data, and the domain we are within.

In this post, I am going to use credit card transactions (it is based on the Machine Learning for Credit Card Fraud Detection — Practical Handbook) and use that data to create new features.

The dataset has very few features, as illustrated in the picture below. Date and time for the transaction, id of the customer, what terminal was used and amount. It also has an added feature, or rather a label, that flags if the transaction was fraudulent or not, and that would be our target when training a machine learning model at a later stage.

Challenges with the existing features

Using only the existing features would make it very hard for a machine learning algorithm to find a pattern that could indicate fraudulent behavior.

My data have three types of features; identifiers, timestamp and amount.

Identifiers do not provide any value, primary because they usually has a high cardinality, meaning a high number of unique values, and therefore they would only provide value if a specific customer or terminal is responsible for the majority of the frauds.

Date and time is another example of features that does not work well with most machine learning algorithms. This is because the value itself has not so much meaning without domain knowledge, for example, 2019–04–01 and 2019–04–08 is just two different values. However, both you and I can probably see that they are for the year 2019 and the month of April and if using a calendar we can also see that they both are Mondays. This is because we know that they are dates and have the knowledge of how to read those, a machine learning algorithm does not have this knowledge so we need to provide that information in another way.

So that leaves me with only amount as a feature that might be useful for identify fraud and depending on the algorithm I am going to use I might also need to transform it into additional features, for examples bins. In this case I will be using a tree based algorithm when training my model meaning it can handle numeric values rather well so I do not need to do that. However, the amount feature is not enough by itself so I need more features.

A solution

In order to provide information about date and time features in a way that a machine learning algorithm can understand I need to convert it into something that can be used by it. So if I think it is important to know what day of week something happens I can extract the day of week from a date and store that in a new feature. Other examples are day of month, month number, hour of day and so on. Since all these features have very few values (day of week have seven or hour of day have twenty four) it gets easier for an algorithm to find a pattern.

For this example I am going to create two new date and time based features, one to indicate if a transaction is done during weekends and another to indicate if it is during night.

In addition, I also want to create some behaviour based features that can be used to capture changes in a customer’s behaviour that could be an indication of fraud.

Popular behaviour-based features are RFM (Recency, Frequency and Monetary value), where recency is how long ago the last interaction was, frequency is how often we interact, and monetary value is how much we spent so far. In this example, I am going to focus on frequency and monetary value based features where I will calculate the number of transactions and average amount for the previous one, seven, and thirty days.

In total, I am going to create eight new features for this dataset, as illustrated in the picture below, which should help an algorithm to find patterns in the data that can be used to predict fraud. So let’s start the fun!

Creating new features with Snowpark

Getting started with Snowpark

Before I can start, I need to set up Snowpark in a development environment and depending on if I am using Jupyter Notebook, IntelliJ, Visual Studio Code or any other IDE that supports Scala, different steps are needed and this is documented in the Snowpark Developer Guide.

First thing I need to do is to import the Snowflake libraries and connect to my Snowflake account, I am using a config file to provide all necessary information and you can find information on how to use it and other options in the Creating a Session for Snowpark section of the documentation.

Next step is to define a Snowpark DataFrame, dfCustTrxFraud, that represents our transactions table. A DataFrame represents a relational dataset that is evaluated lazily, meaning it will only execute when a specific action is triggered. You can define a DataFrame based on an existing table/view, a SQL statement or using an external file, the documentation has examples of different ways to define a dataframe. By using the show() function for a DataFrame I ask for 10 rows to be returned and displayed in the client and it will trigger the execution of the DataFrame in Snowflake.

Date and time based feature creation

I will start by creating the two date and time based features. In order to that, I need to extract the day of week and the hour for when a transaction was done. Snowflake has a lot of date & time functions and most of those are exposed in the Snowpark API (I will talk later about how to use those functions that is not exposed). In order to get the day of the week, I am using the dayofweek method and for getting the hour I am using the hour method.

By using the withColumns function, I can define multiple columns at once by using two Sequences, one for the column names and one for the column definitions.

Finally, I am using the iff function, that is equivalent to an if-then-else expression, in order to set the value to 1 if it is a weekend or if it is during nighttime, and 0 if it is a weekday or during daytime. The lit function is used for setting the 1 or 0.

The return is a new DataFrame object that now has two new columns, TX_DURING_WEEKEND and TX_DURING_NIGHT.

Even if I now have added two new columns to the dfDateTimeFeat dataframe, I have not changed the underlying table. The columns only exists as part of the SQL that define the dataframe.

Behaviour based features creation

The next step is to create the behaviour based features, which is not as straight forward as the date and time features, but what is life without some challenges?

What I want to do is to calculate the number of transactions and average amount on a day level, but I need to store it on a transaction level. The calculations is to be based on previous one, seven and thirty days and also including the transactions done so far during the current day of a transaction.

Below is an example of how I want to count the number of transaction and average amount for previous one day, as you can see I am doing a rolling calculation where I also include the current transaction.

Snowflake has window functions that allows operations on a group of rows and allows you to perform rolling operations, such as calculating a running total or a moving average, on a subset of the rows in the window.

Since a window is either a fixed number of rows or a range of rows that has the same value as the current row, I need to have my data on the level I am going to base my windows on, which in this case is day. I also want the windows to be a complete sequence of days, for example previous seven days for 2019–04–08 should include 2019–04–07, 2019–04–06, 2019–04–05, 2019–04–04, 2019–04–03, 2019–04–02 and 2019–04–01, and if a customer does not have transactions a specific day it would be zero.

In order to generate those calculations, I need to divide work into multiple steps and that is something that is easy to do using a programming language and the Snowpark API.

  1. Generate a dataframe with one row for each customer and day that is between the minimum and maximum date in our data.
  2. Calculate the number of transaction and amount by customer and day, adding zeros for those days the customer has no transactions.
  3. Using Snowflake window functions to calculate the number of transaction and average amount during previous one, seven and thirty days, excluding the current day.
  4. Join it with the transactions and also calculate the transactions and amount during the current day.

In order to define the dataframe that returns one row for each customer and day, that is between the minimum and maximum date in the transactions, I have to first define additional dataframes, one with all dates for the interval and another with all customers, and join those.

To be able to generate a dataframe, dfDays, with all the dates, I need to calculate for how many days I need to generate dates for. In order to do that I have to get the first and the last dates for the transactions and also the number of days between them.

Using the min and max functions on the TX_DATETIME column in the dfCustTrxFraud dataframe will get me the start and end dates of the transactions and by providing them as inputs to datediff I will get the number of days between. The select method is used to select what columns to return and the collect() method executes the SQL generated by these operations and pulls back the result to my client as an array of Rows, which I store in the dateInfo variable.

I can now extract the values of the columns by referring directly to them, using the row and column index, and use a getter function to also get the values with the correct data type.

Now I can define a new dataframe, dfDays, that will return one row for each day that is between the min and max dates in our transactions.

The Snowpark API has a method called range that can be used to generate a fixed number of rows, in this case the number of rows are the number of days between min and max date, which we now have in our local variable nDays.

The range function will return rows that only have an ID column by default, and I need to have a column with a date that I am going to use to do my aggregations. The dateadd function can be used to generate dates, it will add a number of units to the specified part on a date and I want to increase dStartDate with one day for each row. By using the SEQ4 function, that generates a sequence of monotonically increasing integers starting with zero, as the number of units to increase the date with I can do exactly that.

SEQ4 is not exposed in the Snowpark API which is not a problem since Snowpark has functions to call Snowflake built-in functions that are not exposed. callBuiltin is one function that allows me to call a function in Snowflake that is not exposed in the Snowpark API.

Since dateadd returns a timestamp, I also use to_date in order to return the date part of it. The end result is illustrated in the picture below.

I also want to add customer id so each customer has one row for each day. In order to do that I first need to get all customer id’s and have a dataframe for those, dfCustomers, as illustrated below.

By doing a cross join between dfCustomers and dfDays dataframes, I get a new dataframe, dfCustDay, with one row for each customer id and date, as illustrated below.

Below is the code for doing these steps.

By joining dfCustDay dataframe with the dfCustTrxFraud dataframe, I can summarise the number of transactions and amount by customer and day to define a new dataframe, dfCustTrxByDay. For the dates a customer has no transaction, NULL will be returned and by using the ZEROIFNULL function in Snowflake they can be replaced with zero instead.

ZEROIFNULL is also not exposed by the Snowpark API and I could use the callBuiltin function to call it. I can also assign the function call to a variable using the builtin function and then use that variable in the code, making it easier to read.

Now I am ready to calculate the number of transactions and average amount for the previous one, seven, and 30 days. In order to do that, I can now use Snowflakes window functions to do it since the data used for it, dfCustTrxByDay, is on day level.

I start with creating one variable using the Snowpark API WindowSpec object that defines the partitionBy and orderBy sub-clauses. The partitionBy sub-clause allows me to divide a window into sub-windows. In this case, I want to do the calculation by customer so I use Customer ID. The orderBy sub-clause is used to specify how each sub-window should be ordered, in this case I want to order them by date.

For the previous one day calculations I will use the lag function since I only want the values of the previous row, that also is the previous day for a customer. For the seven and thirty days windows, I use the rowsBetween sub-clause to specify how many rows, days, I want to include in the windows.

Once I have the window variables, I use them in combination with sum or lag functions to do the calculations and add that to the dfCustFeatDay dataframe.

Lastly, I am joining the dfCustFeatDay dataframe with the dfDateTimeFeat dataframe to create a final dataframe, dfCustBehaviurFeat, with all features in it.

Since I have not included the current day transactions in my previous calculations, I need to do it now. The way to do that is by using the window functions.

This time I only want to get the transactions before the current transaction for the same day as the current transaction. For that, I will use rangeBetween, which will include all rows that has the same value for the partionBy part. Since I want to include all transactions that are for the same day and for the same customer, I need to create a new column, PARTITION_KEY, where I concatenate Customer ID and the transaction date.

As noted in the code, I am referring to the PARTITION_KEY column without having defined it, the benefit by lazy execution since it only needs to exist when the Window object is used.

I join dfCustFeatDay with dfDateTimeFeat based on date and customer id and the result is that every transaction row in dfDateTimeFeat will have a row from dfCustFeatDay for the same date the transaction is for. The withColumn part is to create the PARTION_KEY and the withColumns to calculate the number of transactions and total amount by the day the transaction row is for.

The select is used to pick the columns I want to use from the dataframes I am joining and also add the rolling sums to the existing calculations.

I have now a dataframe, dfCustBehaviurFeat, with all my new features as well with the initial features. It does now have the execution plan/SQL needed to do all the steps I have done in this article and I can view it using the explain method.

Saving the new features

So far all this only exists as the execution plan/SQL of my dataframe and I want to store the output of it as a new table in Snowflake so I can use it for training a model later on.

In order to do that, I use the write function with the saveAsTable method that will create a table and store the output of the logic of the dataframe in it. By setting mode to overwrite the table will be dropped and recreated if it already exits.

If I would skip the use of the show method, this would be the step that triggers execution within Snowflake, part from the step where I collected the dates etc, and where I am actually using any compute.

Conclusion

With this example, I hope that I have shown you how easy it is to do feature engineering with Snowflake by using a programmatic approach and the Snowpark API.

By using the Snowpark API, I keep the data in Snowflake avoiding unnecessary data movement, execute the logic using Snowflake compute (and can easily scale up/down/in/out), all without the need to create a complex infrastructure for handling my feature engineering pipeline.

Read more

--

--